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Abstract 

In the stochastic knapsack problem, we are given a knapsack of size B, and a set of jobs whose sizes and 
rewards are drawn from a known probability distribution. However, the only way to know the actual size and 
reward is to schedule the job — when it completes, we get to know these values. How should we schedule jobs 
to maximize the expected total reward? We know constant-factor approximations for this problem when we 
assume that rewards and sizes are independent random variables, and that we cannot prematurely cancel jobs 
after we schedule them. What can we say when either or both of these assumptions are changed? 

The stochastic knapsack problem is of interest in its own right, but techniques developed for it are ap- 
plicable to other stochastic packing problems. Indeed, ideas for this problem have been useful for budgeted 
learning problems, where one is given several arms which evolve in a specified stochastic fashion with each 
pull, and the goal is to pull the arms a total of B times to maximize the reward obtained. Much recent work 
on this problem focus on the case when the evolution of the arms follows a martingale, i.e., when the expected 
reward from the future is the same as the reward at the current state. What can we say when the rewards do 
not form a martingale? 

In this paper, we give constant-factor approximation algorithms for the stochastic knapsack problem with 
correlations and/or cancellations, and also for budgeted learning problems where the martingale condition is 
not satisfied, using similar ideas. Indeed, we can show that previously proposed linear programming relax- 
ations for these problems have large integrality gaps. We propose new time-indexed LP relaxations; using a 
decomposition and "gap-filling" approach, we convert these fractional solutions to distributions over strate- 
gies, and then use the LP values and the time ordering information from these strategies to devise a randomized 
adaptive scheduling algorithm. We hope our LP formulation and decomposition methods may provide a new 
way to address other correlated bandit problems with more general contexts. 
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1 Introduction 



Stochastic packing problems seem to be conceptually harder than their deterministic counterparts — imagine a 
situation where some rounding algorithm outputs a solution in which the budget constraint has been exceeded 
by a constant factor. For deterministic packing problems (with a single constraint), one can now simply pick 
the most profitable subset of the items which meets the packing constraint; this would give us a profit within a 
constant of the optimal value. The deterministic packing problems not well understood are those with multiple 
(potentially conflicting) packing constraints. 

However, for the stochastic problems, even a single packing constraint is not simple to handle. Even though they 
arise in diverse situations, the first study from an approximations perspective was in an important paper of Dean 



et al. QDGV08[ ] (see also [ pGV05| , pea05[ ]). They defined the stochastic knapsack problem, where each job has 
a random size and a random reward, and the goal is to give an adaptive strategy for irrevocably picking jobs in 
order to maximize the expected value of those fitting into a knapsack with size B — they gave an LP relaxation 
and rounding algorithm, which produced non-adaptive solutions whose performance was surprisingly within a 
constant-factor of the best adaptive ones (resulting in a constant adaptivity gap, a notion they also introduced). 
However, the results required that (a) the random rewards and sizes for items were independent of each other, and 
(b) once a job was placed, it could not be prematurely canceled — it is easy to see that these assumptions change 
the nature of the problem significantly. 

The study of the stochastic knapsack problem was veiy influential — in particular, the ideas here were used to ob- 



tain approximation algorithms for budgeted learning problems studied by Guha and Munagala [ 5M07b , GM07a, 



GM09| ] and Goel et al. [ |GKN09| ], among others. They considered problems in the multi-armed bandit setting 
with k arms, each arm evolving according to an underlying state machine with probabilistic transitions when 
pulled. Given a budget B, the goal is to pull aims up to B times to maximize the reward — payoffs are associated 
with states, and the reward is some function of payoffs of the states seen during the evolution of the algorithm. 
(E.g., it could be the sum of the payoffs of all states seen, or the reward of the best final state, etc.) The above 
papers gave 0(l)-approximations, index-based policies and adaptivity gaps for several budgeted learning prob- 
lems. However, these results all required the assumption that the rewards satisfied a martingale property, namely, 
if an arm is some state it, one pull of this arm would bring an expected payoff equal to the payoff of state u itself 
— the motivation for such an assumption comes from the fact that the different arms are assumed to be associated 
with a fixed (but unknown) reward, but we only begin with a prior distribution of possible rewards. Then, the 
expected reward from the next pull of the arm, conditioned on the previous pulls, forms a Doob martingale. 

However, there are natural instances where the martingale property need not hold. For instance, the evolution of 
the prior could not just depend on the observations made but on external factors (such as time) as well. Or, in 
a marketing application, the evolution of a customer's state may require repeated "pulls" (or marketing actions) 
before the customer transitions to a high reward state and makes a purchase, while the intermediate states may 
not yield any reward. These lead us to consider the following problem: there are a collection of n arms, each 
characterized by an arbitrary (known) Markov chain, and there are rewards associated with the different states. 
When we play an arm, it makes a state transition according to the associated Markov chain, and fetches the 
corresponding reward of the new state. What should our strategy be in order to maximize the expected total 
reward we can accrue by making at most B pulls in total? 

1.1 Results 

Our main results are the following: We give the first constant-factor approximations for the general version of 
the stochastic knapsack problem where rewards could be correlated with the sizes. Our techniques are general 
and also apply to the setting when jobs could be canceled arbitrarily. We then extend those ideas to give the first 
constant-factor approximation algorithms for a class of budgeted learning problems with Markovian transitions 



where the martingale property is not satisfied. We summarize these in Table 1 
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Problem 


Restrictions 


Paper 


Stochastic Knapsack 


Fixed Rewards, No Cancellation 


pGV05|] 




Correlated Rewards, No Cancellation 


[Section 2| 




Correlated Rewards, Cancellation 


[Section 3| 


Multi- Armed Bandits 


Martingale Assumption 


||GM07b|] 




No Martingale Assumption 


[Section 4 



Table 1: Summary of Results 



1.2 Why Previous Ideas Don't Extend, and Our Techniques 



One reason why stochastic packing problems are more difficult than their deterministic counterparts is that, 
unlike in the deterministic setting, here we cannot simply take a solution with expected reward R* that packs 
into a knapsack of size 2B and convert it (by picking a subset of the items) into a solution which obtains a 
constant fraction of the reward R* whilst packing into a knapsack of size B. In fact, there are examples where 



a budget of 2B can fetch much more reward than what a budget of size B can (see |Appendix A.2[ ). Another 
distinction from deterministic problems is that allowing cancellations can drastically increase the value of the 



solution (see [Appendix A. lh . The model used in previous works on stochastic knapsack and on budgeted learning 
circumvented both issues — in contrast, our model forces us to address them. 



Stochastic Knapsack: Dean et al. JDGV08[ [Dea05| ] assume that the reward/profit of an item is independent 
of its stochastic size. Moreover, their model does not consider the possibility of canceling jobs in the middle. 
These assumptions simplify the structure of the decision tree and make it possible to formulate a (deterministic) 



knapsack-style LP, and round it. However, as shown in [Appendix A , their LP relaxation performs poorly when 
either correlation or cancellation is allowed. This is the first issue we need to address. 

Budgeted Learning: Obtaining approximations for budgeted learning problems is a more complicated task, 
since cancellations maybe inherent in the problem formulation, i.e., any strategy would stop playing a particular 
arm and switch to another, and the rewards by playing any arm are naturally correlated with the (current) state and 
hence the number of previous pulls made on the item/arm. The first issue is often tacked by using more elaborate 
LPs with a flow-like structure that compute a probability distribution over the different times at which the LP 



stops playing an arm (e.g., [ GM07a ]), but the latter issue is less understood. Indeed, several papers on this topic 
present strategies that fetch an expected reward which is a constant-factor of an optimal solution's reward, but 
which may violate the budget by a constant factor. In order to obtain an approximate solution without violating 
the budget, they critically make use of the martingale property — with this assumption at hand, they can truncate 
the last arm played to fit the budget without incurring any loss in expected reward. However, such an idea fails 



when the martingale property is not satisfied, and these LPs now have large integrality gaps (see Appendix A.2). 

At a high level, a major drawback with previous LP relaxations for both problems is that the constraints are local 
for each arm/job, i.e., they track the probability distribution over how long each item/arm is processed (either till 
completion or cancellation), and there is an additional global constraint binding the total number of pulls/total 
size across items. This results in two different issues. For the (correlated) stochastic knapsack problem, these LPs 
do not capture the case when all the items have high contention, since they want to play early in order to collect 
profit. And for the general multi-armed bandit problem, we show that no local LP can be good since such LPs do 
not capture the notion of preempting an arm, namely switching from one arm to another, and possibly returning to 
the original arm later later. Indeed, we show cases when any near-optimal strategy must switch between different 
arms (see Appendix A.3| ) — this is a major difference from previous work with the martingale property where 
there exist near-optimal strategies that never return to any arm [ |GM09 , Lemma 2.1]. At a high level, the lack of 
the martingale property means our algorithm needs to make adaptive decisions, where each move is a function of 
the previous outcomes; in particular this may involve revisiting a particular arm several times, with interruptions 
in the middle. 
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We resolve these issues in the following manner: incorporating cancellations into stochastic knapsack can be 
handled by just adapting the flow-like LPs from the multi-armed bandits case. To resolve the problems of con- 
tention and preemption, we formulate a global time-indexed relaxation that forces the LP solution to commit 
each job to begin at a time, and places constraints on the maximum expected reward that can be obtained if the 
algorithm begins an item a particular time. Furthermore, the time-indexing also enables our rounding scheme 
to extract information about when to preempt an arm and when to re-visit it based on the LP solution; in fact, 
these decisions will possibly be different for different (random) outcomes of any pull, but the LP encodes the 
information for each possibility. We believe that our rounding approach may be of interest in other applications 
in Stochastic optimization problems. 

Another important version of budgeted learning is when we are allowed to make up to B plays as usual but now 
we can "exploit" at most K times: reward is only fetched when an aim is exploited and again depends on its 
current state. There is a further constraint that once an aim is exploited, it must then be discarded. The LP-based 
approach here can be easily extended to that case as well. 

1.3 Roadmap 



We begin in [Section 2| by presenting a constant-factor approximation algorithm for the stochastic knapsack prob- 
lem (StocK) when rewards could be correlated with the sizes, but decisions are irrevocable, i.e., job cancellations 



are not allowed. Then, we build on these ideas in [Section 3J , and present our results for the (correlated) stochastic 
knapsack problem, where job cancellation is allowed. 

In Section 4 we move on to the more general class of multi-armed bandit (MAB) problems. For clarity in 
exposition, we present our algorithm for MAB, assuming that the transition graph for each arm is an arborescence 



(i.e., a directed tree), and then generalize it to arbitrary transition graphs in [Section 5 



We remark that while our LP-based approach for the budgeted learning problem implies approximation algo- 
rithms for the stochastic knapsack problem as well, the knapsack problem provides a gentler introduction to the 
issues — it motivates and gives insight into our techniques for MAB. Similarly, it is easier to understand our tech- 
niques for the MAB problem when the transition graph of each arm's Markov chain is a tree. Several illustrative 
examples are presented in [Appendix A , e.g., illustrating why we need adaptive strategies for the non-martingale 
MAB problems, and why some natural ideas do not work. Finally, the extension of our algorithm for MAB for the 
case when rewards are available only when the arms are explicitly exploited with budgets on both the exploration 



and exploitation pulls appear in |Appendix F[ Note that this algorithm strictly generalizes the previous work on 
budgeted learning for MAB with the martingale property [ |GM07a ]. 



1.4 Related Work 



Stochastic scheduling problems have been long studied since the 1960s (e.g., JBL97| , |Pin95| ]); however, there 
are fewer papers on approximation algorithms for such problems. Kleinberg et al. [KRTOC], and Goel and 



Indyk [ |GI99| ] consider stochastic knapsack problems with chance constraints: find the max-profit set which will 
overflow the knapsack with probability at most p. However, their results hold for deterministic profits and specific 
size distributions. Approximation algorithms for minimizing average completion times with arbitrary job-size 
distributions was studied by [ MSU99 , |SU01 ]. The work most relevant to us is that of Dean, Goemans and 
Vondrak [DGV08, DGV05, Dea05| ] on stochastic knapsack and packing; apart from algorithms (for independent 
rewards and sizes), they show the problem to be PSPACE-hard when correlations are allowed. [CR06] study 



stochastic flow problems. Recent work of Bhalgat et al. [BGK1 1] presents a PTAS but violate the capacity by a 
factor (1 + e); they also get better constant-factor approximations without violations. 



The general area of learning with costs is a rich and diverse one (see, e.g., [ |Ber05| , pit89[ ]). Approximation algo- 
rithms start with the work of Guha and Munagala [ 3M07a| ], who gave LP-rounding algorithms for some prob- 
lems. Further papers by these authors [GMS07, GM09j ] and by Goel et al. [ pKNOS ] give improvements, relate 
LP-based techniques and index-based policies and also give new index policies. (See also [ ]GGM06| , |GM07b[ ].) 
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[ GM09| ] considers switching costs, [ pMPll ] allows pulling many arms simultaneously, or when there is delayed 
feedback. All these papers assume the martingale condition. 



2 The Correlated Stochastic Knapsack without Cancellation 

We begin by considering the stochastic knapsack problem (StocK), when the job rewards may be correlated 



with its size. This generalizes the problem studied by Dean et al. [DGV05] who assume that the rewards are 
independent of the size of the job. We first explain why the LP of [ DGV05| ] has a large integrality gap for 
our problem; this will naturally motivate our time-indexed formulation. We then present a simple randomized 
rounding algorithm which produces a non-adaptive strategy and show that it is an 0(l)-approximation. 

2.1 Problem Definitions and Notation 

We are given a knapsack of total budget B and a collection of n stochastic items. For any item i € [1, n], we are 
given a probability distribution over (size, reward) pairs specified as follows: for each integer value of t G [1, B], 
the tuple (7Tij,Ri,t) denotes the probability 7r i t that item i has a size t, and the corresponding reward is R i>t . 
Note that the reward for a job is now correlated to its size; however, these quantities for two different jobs are 
still independent of each other. 

An algorithm to adaptively process these items can do the following actions at the end of each timestep; (i) an 
item may complete at a certain size, giving us the corresponding reward, and the algorithm may choose a new 
item to start processing, or (ii) the knapsack becomes full, at which point the algorithm cannot process any more 
items, and any currently running job does not accrue any reward. The objective function is to maximize the total 
expected reward obtained from all completed items. Notice that we do not allow the algorithm to cancel an item 



before it completes. We relax this requirement in Section 3. 



2.2 LP Relaxation 



The LP relaxation in [DGV05] was (essentially) a knapsack LP where the sizes of items are replaced by the 
expected sizes, and the rewards are replaced by the expected rewards. While this was sufficient when an item's 



reward is fixed (or chosen randomly but independent of its size), we give an example in Appendix A.2 where 
such an LP (and in fact, the class of more general LPs used for approximating MAB problems) would have a 



large integrality gap. As mentioned in Section 1.2, the reason why local LPs don't work is that there could be 
high contention for being scheduled early (i.e., there could be a large number of items which all fetch reward if 
they instantiate to a large size, but these events occur with low probability). In order to capture this contention, 
we write a global time-indexed LP relaxation. 

The variable x^t £ [0, 1] indicates that item i is scheduled at (global) time t; Si denotes the random variable for 
the size of item i, and ERj t = Yl s <B-t ^i,sR'i s captures the expected reward that can be obtained from item i if 
it begins at time t; (no reward is obtained for sizes that cannot fit the (remaining) budget.) 



max I];,t ER i,t ■ x i,t 

*Ei,t'<t Ht' ■ E[min(5i, t)] < 2t Mt € [B] 

Xi,tG[0,l] Vi€[fl],Vi 



(LP|\|oCancel) 
(2.1) 

(2.2) 
(2.3) 



While the size of the above LP (and the running time of the rounding algorithm below) polynomially depend on 
B, i.e., pseudo-polynomial, it is possible to write a compact (approximate) LP and then round it; details on the 



polynomial time implementation appear in Appendix B.2 



Notice the constraints involving the truncated random variables in equation (2.2): these are crucial for showing 
the correctness of the rounding algorithm StOcK-NoCancel. Furthermore, the ideas used here will appear sub- 
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sequently in the MAB algorithm later; for MAB, even though we can't explicitly enforce such a constraint in the 
LP, we will end up inferring a similar family of inequalities from a near-optimal LP solution. 



Lemma 2.1 The above relaxation is valid for the StocK problem when cancellations are not permitted, and has 
objective value LPOpt > Opt, where Opt is the expected profit of an optimal adaptive policy. 

Proof. Consider an optimal policy Opt and let x* t denote the probability that item % is scheduled at time t. We 



first show that {x* } is a feasible solution for the LP relaxation LP|\| c an cei ■ It is easy to see that constraints (2.1) 



and ( |2.3| ) are satisfied. To prove that ( |2.2[ ) are also satisfied, consider some t € [B] and some run (over random 
choices of item sizes) of the optimal policy. Let lfp be indicator variable that item i is scheduled at time t' and 
let be the indicator variable for whether the size of item i is s. Also, let L t be the random variable indicating 
the last item scheduled at or before time t. Notice that L t is the only item scheduled before or at time t whose 
execution may go over time t. Therefore, we get that 

i^L t t'<ts<B 

Including L t in the summation and truncating the sizes by t, we immediately obtain 

i t'<t s 

Now, taking expectation (over all of Opt's sample paths) on both sides and using linearity of expectation we have 

EEE E \^ ed ■ • min M ^ 2L 

i t'<t s 

However, because Opt decides whether to schedule an item before observing the size it instantiates to, we have 
that lf c t / and lf^ e are independent random variables; hence, the LHS above can be re-written as 



EEE Fr WP ed = 1 A x S e = min ^ *) 

i t'<t s 

i t'<t s 



i t'<t 



Hence constraints (|2J) are satisfied. Now we argue that the expected reward of Opt is equal to the value of the 
solution x*. Let Oi be the random variable denoting the reward obtained by Opt from item i. Again, due to the 
independence between Opt scheduling an item and the size it instantiates to, we get that the expected reward that 
Opt gets from executing item i at time t is 

E[Oi|l?J ed = 1] = E ~- /l> '-> = ER M- 

s<B-t 

Thus the expected reward from item i is obtained by considering all possible starting times for i: 

Em = E Pr Wf d = i] • E[o < |ig" d = i] = E er m • <*• 



This shows that |LP|\| c a nceil is a valid relaxation for our problem and completes the proof of the lemma. 
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Algorithm 2.1 Algorithm StOCK-NoCancel 



1: for each item i, assign a random start-time D, L = t with probability -J£; with probability 1 — J2t ~1T' 

completely ignore item % (Di = oo in this case). 
2: for j from 1 to n do 

3: Consider the item i which has the jth smallest deadline (and Di ^ oo) 
4: if the items added so far to the knapsack occupy at most Di space then 
5: add i to the knapsack. 



We are now ready to present our rounding algorithm StOCK-NoCancel ( Algorithm 2.1[ ). It a simple randomized 



rounding procedure which (i) picks the start time of each item according to the corresponding distribution in the 
optimal LP solution, and (ii) plays the items in order of the (random) start times. To ensure that the budget is not 
violated, we also drop each item independently with some constant probability. 

Notice that the strategy obtained by the rounding procedure obtains reward from all items which are not dropped 



and which do not fail (i.e. they can start being scheduled before the sampled start-time Di in [Step 1| ); we now 
bound the failure probability. 

Lemma 2.2 For every i, Pr(i fails \ Di = t) < 1/2. 

Proof. Consider an item i and time t ^ oo and condition on the event that Di = t. Let us consider the execution 



of the algorithm when it tries to add item i to the knapsack in |steps 3| -p|. Now, let Z be a random variable denoting 
how much of the interval [0, t] of the knapsack is occupied by previously scheduling items, at the time when i is 
considered for addition; since i does not fail when Z < t, it suffices to prove that Pr(Z > t) < 1/2. 

For some item j ^ i, let 1-Dj<t be the indicator variable that Dj < t; notice that by the order in which algorithm 
StOCK-NoCancel adds items into the knapsack, it is also the indicator that j was considered before i. In addition, 
let l'-" e be the indicator variable that Sj = s. Now, if Zj denotes the total amount of the interval [0, t] that that j 
occupies, we have 

^<l .< t ^lf/min( S ,i). 

s 

Now, using the independence of ^-D 3 <t and lj^ e , we have 

E[Zj] < E[l D .< t ] ■ E[min(5j, t)} = \ £ t ,< t x* t , ■ E[mm(Sj,t)] (2.4) 



Since Z = ^ • Zj, we can use linearity of expectation and the fact that {x*} satisfies LP constraint (12) to get 

HZ] < \ Ej Z«<t*i,* • EfminCS,-,*)] < \ . 
To conclude the proof of the lemma, we apply Markov's inequality to obtain Pr(Z > t) < 1/2. ■ 
To complete the analysis, we use the fact that any item chooses a random start time Di = t with probability 



x* t /4, and conditioned on this event, it is added to the knapsack with probability at least 1/2 from |Lemma 2.2 
in this case, we get an expected reward of at least ERj t . The theorem below (formally proved in |Appendix B.l 
then follows by linearity of expectations. 

Theorem 2.3 The expected reward of our randomized algorithm is at least | o/LPOpt. 

3 Stochastic Knapsack with Correlated Rewards and Cancellations 

In this section, we present our algorithm for stochastic knapsack (StocK) where we allow correlations between 



rewards and sizes, and also allow cancellation of jobs. The example in Appendix A.l shows that there can be an 



6 



arbitrarily large gap in the expected profit between strategies that can cancel jobs and those that can't. Hence we 
need to write new LPs to capture the benefit of cancellation, which we do in the following manner. 

Consider any job j: we can create two jobs from it, the "early" version of the job, where we discard profits from 
any instantiation where the size of the job is more than B/2, and the "late" version of the job where we discard 
profits from instantiations of size at most B/2. Hence, we can get at least half the optimal value by flipping a fair 
coin and either collecting rewards from either the early or late versions of jobs, based on the outcome. In the next 
section, we show how to obtain a constant factor approximation for the first kind. For the second kind, we argue 



that cancellations don't help; we can then reduce it to StocK without cancellations (considered in Section 2). 
3.1 Case I: Jobs with Early Rewards 

We begin with the setting in which only small-size instantiations of items may fetch reward, i.e., the rewards Ri )t 
of every item i are assumed to be for t > B/2. In the following LP relaxation |LPg| , v^t G [0, 1] tries to capture 
the probability with which Opt will process item i for at least t timesteps[], G [0, 1] is the probability that 
Opt stops processing item i exactly at t timesteps. The time-indexed formulation causes the algorithm to have 
running times of poly(-B) — however, it is easy to write compact (approximate) LPs and then round them; we 



describe the necessary changes to obtain an algorithm with running time poly(n, log B) in |Appendix C.2 . 

max El<t<B/2 El<i<n v i,t ■ ^ LvSVt' (LP5) 

Vi,t = s it t + t>i,t+i Vi £ [0, B], i G [n] (3.5) 

a it >— Vl)t ViG [0,B], i€ [n] (3.6) 

l^t>>t Ki,t' 

Eie[n] Ete[o,B] t ■ s i,t < B (3.7) 

Vi, = l Vi (3.8) 

Vi,u s i,t G [0, 1] Vte[0,B],i£[n] (3.9) 



Theorem 3.1 The linear program (LPs) is a valid relaxation for the StocK problem, and hence the optimal 



value LPOpt of the LP is at least the total expected reward Opt of an optimal solution. 

Proof. Consider an optimal solution Opt and let v* t and s* t denote the probability that Opt processes item i for 
at least t timesteps, and the probability that Opt stops processing item i at exactly t timesteps. We will now show 



that all the constraints of LPs are satisfied one by one. 

To this end, let Ri denote the random variable (over different executions of Opt) for the amount of processing 
done on job i. Notice that Pr{Ri > t] = Pr[Ri > (t + 1)] + Pr[i?j = t}. But now, by definition we have 
Pr[-Rj >t] = v* t and Pr[i?j = t] = s* t . This shows that {v*, s*} satisfies these constraints. 

For the next constraint, observe that conditioned on Opt running an item i for at least t time steps, the probability 
of item i stopping due to its size having instantiated to exactly equal to t is m-t/ J2t'>t 7T i,t'^ i- e -> P r [^« = t I 



Ri > t] > Ki,t/Ylt'>t n ht'- This shows that {v* , s*} satisfies constraints (3.6). 



Finally, to see why constraint (3.7) is satisfied, consider any particular run of the optimal algorithm and let lf° p 



denote the indicator random variable of the event Ri = t. Then we have 

i t 

Now, taking expectation over all runs of Opt and using linearity of expectation and the fact that E[l** op ] = s* t , 



we get constraint (3.7). As for the objective function, we again consider a particular run of the optimal algorithm 



and let lf™ c now denote the indicator random variable for the event (Ri > t), and Iff 6 denote the indicator 



In the following two sections, we use the word timestep to refer to processing one unit of some item. 
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variable for whether the size of item i is instantiated to exactly t in this run. Then we have the total reward 
collected by Opt in this run to be exactly 



x x pj ot size r> 
Z> l^t \t • H,t ■ K i,t 
i t 

Now, we simply take the expectation of the above random variable over all runs of Opt, and then use the following 

. proc* s i Zl 
l i,t *-i.t V 

E[17 C 1— ] = Pr[lf t ° c = 1 A li:r = 1] 

= Pr[l^ c = l]Pr[l|- e = l|l^ c = l] 



fact about E[in oc lf* e i ■ 



We thus get that the expected reward collected by Opt is exactly equal to the objective function value of the LP 
formulation for the solution (v* ,s*). ■ 

Our rounding algorithm is veiy natural, and simply tries to mimic the probability distribution (over when to stop 
each item) as suggested by the optimal LP solution. To this end, let (v*, s*) denote an optimal fractional solution. 
The reason why we introduce some damping (in the selection probabilities) up-front is to make sure that we could 
appeal to Markov's inequality and ensure that the knapsack does not get violated with good probability. 

Algorithm 3.1 Algorithm StocK-Small 



for each item i do 

ignore i with probability 1 — 1/4 (i.e., do not schedule it at all). 
forO < t < B/2 do 

cancel item i at this step with probability — ^ 7Tl ' t and continue to next item. 

v i,t 2-^t'>t 7r i,t' 

process item i for its (t + l) s * timestep. 

if item i terminates after being processed for exactly (t + 1) timesteps then 
collect a reward of Ra+i from this item; continue onto next item; 



Notice that while we let the algorithm proceed even if its budget is violated, we will collect reward only from 
items that complete before time B. This simplifies the analysis a fair bit, both here and for the M AB algorithm. In 
Lemma 3^2| below (proof in Appendix~C| ), we show that for any item that is not dropped in step 2 , its probability 



distribution over stopping times is identical to the optimal LP solution s*. We then use this to argue that the 
expected reward of our algorithm is 0(l)LP0pt. 



Lemma 3.2 Consider item i that was not dropped in step 2\ Then, for any timestep t > 0, the following hold: 



(i) The probability (including cancellation& completion) of stopping at timestep tfor item i is s* t . 

(ii) The probability that item i gets processed for its (t + l) s * timestep is exactly v* t+1 

(Hi) If item i has been processed for (t+ 1) timesteps, the probability of completing successfully at timestep 
(t + 1) is Wi,t+i/Ylt'>t+i H*' 
Theorem 3.3 The expected reward of our randomized algorithm is at least | o/LPOpt. 

Proof. Consider any item i. In the worst case, we process it after all other items. Then the total expected size 
occupied thus far is at most Yli'^i l^, eep X]t>o * ' s i' v wri ere \ k ^ ep i s the indicator random variable denoting 
whether item i' is not dropped in |step 2\ Here we have used Lemma 3.2 to argue that if an item i' is selected, 



its stopping-time distribution follows s*, t . Taking expectation over the randomness in |step 2 , the expected space 



occupied by other jobs is at most 2~2i'^i 3 2~2t>o^ ' s *i> t — T - Markov's inequality implies that this is at most 
B/2 with probability at least 1/2. In this case, if item i is started (which happens w.p. 1/4), it runs without 
violating the knapsack, with expected reward J2t>i v tt ' 7r i ! */(X)t'>t 7r i,*')> ^ e tota l expected reward is then at 
least Ei I Et <^,*/&>t > ^ ■ 
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3.2 Case II: Jobs with Late Rewards 



Now we handle instances in which only large-size instantiations of items may fetch reward, i.e., the rewards 
Ri t t of eveiy item i are assumed to be for t < B/2. For such instances, we now argue that cancellation is 
not helpful. As a consequence, we can use the results of Section 2| and obtain a constant-factor approximation 
algorithm! 

To see why, intuitively, as an algorithm processes a job for its t th timestep for t < B/2, it gets no more informa- 
tion about the reward than when starting (since all rewards are at large sizes). Furthermore, there is no benefit of 
canceling a job once it has run for at least B/2 timesteps - we can't get any reward by starting some other item. 

More formally, consider a (deterministic) strategy S which in some state makes the decision of scheduling item i 
and halting its execution if it takes more than t timesteps. First suppose that t < B/2; since this job does will not 
be able to reach size larger than B/2, no reward will be accrued from it and hence we can change this strategy by 
skipping the scheduling of i without altering its total reward. Now consider the case where t > B/2. Consider 
the strategy S' which behaves as S except that it does not preempt i in this state but lets i run to completion. 
We claim that S' obtains at least as much expected reward as S. First, whenever item i has size at most t then 
S and S' obtain the same reward. Now suppose that we are in a scenario where i reached size t > B/2. Then 
item i is halted and S cannot obtain any other reward in the future, since no item that can fetch any reward would 
complete before the budget runs out; in the same situation, strategy S' obtains non-negative rewards. Using this 
argument we can eliminate all the cancellations of a strategy without decreasing its expected reward. 

Lemma 3.4 There is an optimal solution in this case which does not cancel. 



As mentioned earlier, we can now appeal to the results of [Section 2| and obtain a constant-factor approximation for 



the large-size instances. Now we can combine the algorithms that handle the two different scenarios (or choose 
one at random and run it), and get a constant fraction of the expected reward that an optimal policy fetches. 

4 Multi-Armed Bandits 

We now turn our attention to the more general Multi-Armed Bandits problem (MAB). In this framework, there 
are n arms: arm i has a collection of states denoted by Si, a starting state pi € Sf, Without loss of generality, we 
assume that S{ n «Sj • = for i ^ j. Each arm also has a transition graph Tj, which is given as a polynomial-size 
(weighted) directed tree rooted at p, L ; we will relax the tree assumption later. If there is an edge u — > v in Tj, then 
the edge weight p u>v denotes the probability of making a transition from u to v if we play arm i when its current 
state is node u; hence Ylvtu u)eT Pu,v = 1- Each time we play an arm, we get a reward whose value depends on 
the state from which the arm is played. Let us denote the reward at a state uby r u . Recall that the martingale 
property on rewards requires that J2 v -(u v)eTi Pu,vf v = r u for all states u. 

Problem Definition. For a concrete example, we consider the following budgeted learning problem on tree 
transition graphs. Each of the aims starts at the start state pi £ Si. We get a reward from each of the states we 
play, and the goal is to maximize the total expected reward, while not exceeding a pre-specified allowed number 
of plays B across all arms. The framework described below can handle other problems (like the explore/exploit 



kind) as well, and we discuss this in Appendix F 



Note that the Stochastic Knapsack problem considered in the previous section is a special case of this problem 
where each item corresponds to an aim, where the evolution of the states corresponds to the explored size for the 
item. Rewards are associated with each stopping size, which can be modeled by end states that can be reached 
from the states of the corresponding size with the probability of this transition being the probability of the item 
taking this size. Thus the resulting trees are paths of length up to the maximum size B with transitions to end 



states with reward for each item size. For example, the transition graph in [Figure 4. 1| corresponds to an item 
which instantiates to a size of 1 with probability 1/2 (and fetches a reward R\), takes size 3 with probability 
1/4 (with reward R%), and size 4 with the remaining probability 1/4 (reward is R4). Notice that the reward on 
stopping at all intermediate nodes is and such an instance therefore does not satisfy the martingale property. 
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Even though the rewards are obtained in this example on reaching a state rather than playing it, it is not hard to 
modify our methods for this version as well. 




Figure 4.1: Reducing Stochastic Knapsack to MAB 

Notation. The transition graph Tj for arm i is an out-arborescence denned on the states Si rooted at pi. Let 
depth (n) of a node u G Si be the depth of node u in tree Tj, where the root pi has depth 0. The unique parent 
of node u in Tj is denoted by parent(u). Let S = Uj5j denote the set of all states in the instance, and arm(u) 
denote the arm to which state u belongs, i.e., the index i such that u G Si. Finally, for u G Si, we refer to the act 
of playing arm i when it is in state u as "playing state u G Si", or "playing state u" if the arm is clear in context. 

4.1 Global Time-indexed LP 

In the following, the variable z Ui t G [0, 1] indicates that the algorithm plays state u G Si at time t. For state 
u G Si and time t, w u> t G [0, 1] indicates that arm i first enters state u at time t: this happens if and only if the 
algorithm played parent(-u) at time t — 1 and the arm made a transition into state u. 



max Hu,t r u ■ z u,t (LP m ab) 

ViG [2,B], u£S\Ui{pi} (4.10) 

J2t'<t w u,f >Ylf<t z u,t> vt g [i,B], u g s (4.ii) 

E u6 S-Vt<l VtG[l,B] (4.12) 

«W = 1 Vi€[l,n] (4.13) 

Lemma 4.1 T/ze value of an optimal LP solution LPOpt is at least Opt, the expected reward of an optimal 
adaptive strategy. 

Proof. We convention that Opt starts playing at time 1. Let z* t denote the probability that Opt plays state u at 
time t, namely, the probability that arm arm(u) is in state u at time t and is played at time t. Also let t denote 
the probability that Opt "enters" state u at time t, and further let w* 1 = 1 for all i. 



We first show that {z*,w*} is a feasible solution for |LP ma b| and later argue that its LP objective is at least Opt. 
Consider constraint ( |4.10| ) for some t G [2,B] and u G S. The probability of entering state u at time t conditioned 
on Opt playing state parent(u) at time t — 1 is p pa rent(u),«- m addition, the probability of entering state u at time 
t conditioning on Opt not playing state parent(it) at time t — 1 is zero. Since £* arent ( u ) t _i is the probability that 
Opt plays state parent(u) at time t — 1, we remove the conditioning to obtain w* t = z* arent ( u ) t _i ■ P pa rent(?i),u- 

Now consider constraint ( |4.11 ) for some t G [1, B] and u G S. For any outcome of the algorithm (denoted by a 
sample path a), let l^?*/ r be the indicator variable that Opt enters state v! at time t' and let l^, a X be the indicator 
variable that Opt plays state u' at time t' . Since is acyclic, state u is played at most once in a and is also 
entered at most once in a. Moreover, whenever u is played before or at time t, it must be that u was also entered 
before or at time t, and hence Ylt'<t ^ut^ — St'<t l«"f' ei Taking expectation on both sides and using the fact 
that Ellffi = zl t , and E[lg^ = < |t „ linearity of expectation gives J2t><t <,t> < Et><t K,f 
To see that constraints ( |4.12 ) are satisfied, notice that we can play at most one arm (or alternatively one state) in 
each time step, hence X] M6 s l-uT — 1 holds for all t G [1, JB]; the claim then follows by taking expectation on 
both sides as in the previous paragraph. Finally, constraints ( [4.1 3| ) is satisfied by definition of the start states. 
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To conclude the proof of the lemma, it suffices to show that Opt = t r u ■ z* t . Since Opt obtains reward r u 
whenever it plays state u, it follows that Opt's reward is given by J2 U t r « ' 1« t^! by taking expectation we get 
Z) u t r uZu,t = Opt. and nence LPOpt > Opt. ■ 

4.2 The Rounding Algorithm 

In order to best understand the motivation behind our rounding algorithm, it would be useful to go over the 
example which illustrates the necessity of preemption (repeatedly switching back and forth between the different 



arms) in Appendix A.3 



At a high level, the rounding algorithm proceeds as follows. In Phase I, given an optimal LP solution, we 
decompose the fractional solution for each arm into a convex^] combination of integral "strategy forests" (which 



are depicted in Figure 4.2): each of these tells us at what times to play the arm, and in which states to abandon the 
arm. Now, if we sample a random strategy forest for each arm from this distribution, we may end up scheduling 
multiple arms to play at some of the timesteps, and hence we need to resolve these conflicts. A natural first 
approach might be to (i) sample a strategy forest for each arm, (ii) play these arms in a random order, and (iii) for 
any arm follow the decisions (about whether to abort or continue playing) as suggested by the sampled strategy 
forest. In essence, we are ignoring the times at which the sampled strategy forest has scheduled the plays of this 
arm and instead playing this arm continually until the sampled forest abandons it. While such a non-preemptive 



strategy works when the martingale property holds, the example in Appendix A.3 shows that preemption is 
unavoidable. 

Another approach would be to try to play the sampled forests at their prescribed times; if multiple forests want 
to play at the same time slot, we round-robin over them. The expected number of plays in each timestep is 1, 
and the hope is that round-robin will not hurt us much. However, if some arm needs B contiguous steps to get to 
a state with high reward, and a single play of some other arm gets scheduled by bad luck in some timestep, we 
would end up getting nothing! 

Guided by these bad examples, we try to use the continuity information in the sampled strategy forests — once 
we start playing some contiguous component (where the strategy forest plays the arm in every consecutive time 
step), we play it to the end of the component. The naive implementation does not work, so we first alter the LP 
solution to get convex combinations of "nice" forests — loosely, these are forests where the strategy forest plays 
contiguously in almost all timesteps, or in at least half the timesteps. This alteration is done in Phase II, and then 



the actual rounding in Phase III, and the analysis appears in [Section 4.2.3 
4.2.1 Phase I: Convex Decomposition 

In this step, we decompose the fractional solution into a convex combination of "forest-like strategies" {T(i, j)}ij, 
corresponding to the j th forest for arm i. We first formally define what these forests look like: The j th strategy 
forest T(i, j) for arm i is an assignment of values time(z, j, u) and prob(i, j, u) to each state u € Si such that: 

(i) For u G Si and v = parent(u), it holds that time(i, j, u) > 1 + time(i, j, v), and 

(ii) For u e Si and v = parent(u), if i\me(i,j, u) / oo then prob(i,j, u) = p VtU prob(i, j,v); else if 
time(i,j, u) = oo then prob(i,j, u) =0. 

We call a triple (i, j, u) a tree-node of T(i, j). When i and j are understood from the context, we identify the 
tree-node (i, j, u) with the state u. 

For any state u, the values time(i, j, u) and prob(i, j, u) denote the time at which the arm i is played at state u, and 
the probability with which the ami is played, according to the strategy forest T(i, The probability values are 
particularly simple: if time(i, j, u) = oo then this strategy does not play the aim at u, and hence the probability 



2 Strictly speaking, we do not get convex combinations that sum to one; our combinations sum to z Pi ,t, the value the LP assigned 
to pick to play the root of the arm over all possible start times, which is at most one. 

3 When i and j are clear from the context, we will just refer to state it instead of the triple (i, j, it). 
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is zero, else prob(z, j, u) is equal to the probability of reaching u over the random transitions according to Tj if 
we play the root with probability prob(i, j, pi). Hence, we can compute prob(i, j, u) just given prob(i, j, p{) and 
whether or not time(i, j, u) = oo. Note that the time values are not necessarily consecutive, plotting these on the 
timeline and connecting a state to its parents only when they are in consecutive timesteps (as in |Figure 4.2 ) gives 
us forests, hence the name. 




head( 



head(c) 



X 

-I 




2 3 4 5 6 7 8 9 10 11 12 13 oo 
(a) Strategy forest: numbers are times (b) Strategy forest shown on a timeline 

Figure 4.2: Strategy forests and how to visualize them: grey blobs are connected components. 

The algorithm to construct such a decomposition proceeds in rounds for each arm i; in a particular round, it 
"peels" off such a strategy as described above, and ensures that the residual fractional solution continues to 
satisfy the LP constraints, guaranteeing that we can repeat this process, which is similar to (but slightly more 



involved than) performing flow-decompositions. The decomposition lemma is proved in [Appendix D.l 



Lemma 4.2 Given a solution to ( LP ma j j, there exists a collection of at most nB\S\ strategy forests {T(i, j)} 
such that z u .t = Ej:time(ij»=i prob{i, j , u).f\ Hence, E(ij,„) : time(i,i,u)=t P ro K*\ J, < I for all t. 

For any T(z,j), these prob values satisfy a "preflow" condition: the in-flow at any node v is always at least 
the out-flow, namely prob(i, j, v) > J2 u -parent(u)=v P ro] °{hji u )- This leads to the following simple but crucial 
observation. 

Observation 4.3 For any arm i, for any set of states X C Si such that no state in X is an ancestor of another 
state in X in the transition tree Ti, and for any z £ Si that is an ancestor of all states in X, prob(z, j, z) > 
Exex prob(z,j',x). 

More generally, given similar conditions on X, if Z is a set of states such that for any x E X, there exists z G Z 
such that z is an ancestor of x, we have Ylz&z P r °b(«, j, z) > J2 x ex pfob(i, j, x) 

4.2.2 Phase II: Eliminating Small Gaps 



While |Appendix A.3| shows that preemption is necessary to remain competitive with respect to Opt, we also 
should not get "tricked" into switching arms during very short breaks taken by the LP. For example, say, an ami 
of length (B — 1) was played in two continuous segments with a gap in the middle. In this case, we should 
not lose out on profit from this arm by starting some other arms ' plays during the break. To handle this issue, 
whenever some path on the strategy tree is almost contiguous — i.e., gaps on it are relatively small — we make 
these portions completely contiguous. Note that we will not make the entire tree contiguous, but just combine 
some sections together. 



To reiterate, even t hough we call this a convex decomposition, the sum of the probability values of the root state of any arm is at most 
one by constraint 4.12, and hence the sum of the probabilities of the root over the decomposition could be less than one in general. 
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Before we make this formal, here is some useful notation: Given u € Si, let Head(i, j, u) be its ancestor node 
v € Si of least depth such that the plays from v through u occur in consecutive time values. More formally, 
the path v = v\, t>2, ■ ■ ■ , vi = u in Tj is such that time(i, j, vy) = time(i, j, vi'-i) + 1 for all /' € [2, 1]. We 
also define the connected component of a node u, denoted by comp(i, j, u), as the set of all nodes v! such that 
Head(f, j, u) = Head(i, j, u'). Figure 4.2 shows the connected components and heads. 

The main idea of our gap-filling procedure is the following: if a head state v = Head(i, j, u) is played at time 
t = time(i, j, v) s.t. t < 2 • depth(u), then we "advance" the comp(i, j, v) and get rid of the gap between v and 
its parent (and recursively apply this rule)^. The procedure can be described in more detail as follows. 



Algorithm 4.1 Gap Filling Algorithm Gap Fill 
1: for r = B to 1 do 

2: while there exists a tree-node u € T(i,j) such that r = time(Head(u)) < 2 • depth(Head(n)) do 

3: let v = Head(u). 

4: if v is not the root of T(i, j) then 

5: let v 1 = parent(v). 

6: advance the component comp(u) rooted at v such that time(t?) <— time(v') + 1, to make comp(f ) 

contiguous with the ancestor forming one larger component. Also alter the times of w € comp(t> ) 
appropriately to maintain contiguity with v (and now with v'). 



One crucial property is that these "advances" do not increase by much the number of plays that occur at any given 
time t. Essentially this is because if for some time slot t we "advance" a set of components that were originally 
scheduled after t to now cross time slot t, these components moved because their ancestor paths (fractionally) 
used up at least t/2 of the time slots before t; since there are t time slots to be used up, each to unit extent, there 
can be at most 2 units of components being moved up. Hence, in the following, we assume that our T's satisfy 
the properties in the following lemma: 

Lemma 4.4 Algorithm GapFill produces a modified collection of T's such that 

(i) For each i, j,u such that r u > 0, time(Head(i, j, u)) > 2 ■ depth(Head(i, j, u)). 

( ii) The total extent of plays at any time t, i.e., • U y t - Ime ^ j w w prob(i, j, u) is at most 3. 



The proof appears in [Appendix D.2 . 
4.2.3 Phase III: Scheduling the Arms 

Having done the preprocessing, the rounding algorithm is simple: it first randomly selects at most one strategy 
forest from the collection {T(i, for each arm i. It then picks an arm with the earliest connected component 
(i.e., that with smallest time(Head(i, j, u))) that contains the current state (the root states, to begin with), plays 
it to the end — which either results in terminating the arm, or making a transition to a state played much later in 
time, and repeats. The formal description appears in Algorithm 4.2. (If there are ties in [Step 5 , we choose the 
smallest index.) Note that the algorithm runs as long as there is some active node, regardless of whether or not 
we have run out of plays (i.e., the budget is exceeded) — however, we only count the profit from the first B plays 
in the analysis. 

Observe that Steps 1\ 9] play a connected component of a strategy forest contiguously. In particular, this means 
that all currstate(i)'s considered in Step 5 are head vertices of the corresponding strategy forests. These facts 
will be crucial in the analysis. 



Lemma 4.5 For arm i and strategy T(i,j), conditioned on a(i) = j after Step 1 of AlgMAB, the probability of 
playing state u € Si is prob(i, j, u)/prob(i,j, pi), where the probability is over the random transitions of arm i. 



5 The intuition is that such vertices have only a small gap in their play and should rather be played contiguously. 
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Algorithm 4.2 Scheduling the Connected Components: Algorithm AlgMAB 

1: for arm i, sample strategy T(i,j) with probability prob ^< p ^ ■ ignore arm i w.p. 1 — Y2j prob ^' Pi ' . 

2: let A <— set of "active" arms which chose a strategy in the random process. 

3: for each i £ A, let a(i) index j of the chosen T(i,j) and let currstate(i) <— root pi. 

4: while active arms A ^ do 

5: let i* <— arm with state played earliest in the LP (i.e., i* argmin igyl {time(i, currstate(i))}. 

6: let r time(i*, currstate(i*)). 

7: while time(z*,cr(i*),currstate(z*)) / oo and time(i*, a(i*), currstate(i*)) = r do 

8: play arm i* at state currstate(i*) 

9: update currstate(i*) be the new state of arm i*; let r <— r + 1. 

10: if time(i*,cr(i*),currstate(i*)) = oo then 

ll: let A ^ A\{i*} 



The above lemma is relatively simple, and proved in Appendix D.3 . The rest of the section proves that in 



expectation, we collect a constant factor of the LP reward of each strategy T(i, j) before running out of budget; 
the analysis is inspired by our StocK rounding procedure. We mainly focus on the following lemma. 

Lemma 4.6 Consider any arm i and strategy T(i,j). Then, conditioned on a(i) = j and on the algorithm 
playing state u 6 Si, the probability that this play happens before time time(i, j, u) is at least 1/2. 

Proof. Fix an aim i and an index j for the rest of the proof. Given a state u G Si, let £ij u denote the event 
(cr(z) = j) A (state u is played). Also, let v = Head(i, j, u) be the head of the connected component containing 
u in T(i,j). Let r.v. t u (respectively r v ) be the actual time at which state u (respectively state v) is played — these 
random variables take value oo if the arm is not played in these states. Then 

Pt[t u < time(i, j,u) \ £ iju ] > \ <^ Pr[r v < time(i,j,v) | £ iju ] > \, (4.14) 

because the time between playing u and v is exactly i\me(i, j,u) — time(i, j, v) since the algorithm plays con- 
nected components continuously (and we have conditioned on £ij u ). Hence, we can just focus on proving the 
right inequality in ( 4.14| ) for vertex v. 



For brevity of notation, let t v = time(i, j, v). In addition, we define the order ^ to indicate which states 
can be played before v. That is, again making use of the fact that the algorithm plays connected components 
contiguously, we say that (i',j',v') < (i,j,v) iff time(Head(i', j', v')) < time(Head(z, j, v)). Notice that this 
order is independent of the run of the algorithm. 

For each aim i' ^ i and index j', we define random variables Zyji used to count the number of plays that can 
possibly occur before the algorithm plays state v. If lu' j^ v i\ is the indicator variable of event £i>j' v >, define 

Z i',j' = min (*v , T,v>:(i',f,v')^(i,j, V ) 1 (i',j',v')) ■ ( 4 - 15 ) 

We truncate Zy^i at t v because we just want to capture how much time up to t v is being used. Now consider the 
sum Z = Yli'M Ylj' Note that for arm i', at most one of the Zyji values will be non-zero in any scenario, 



namely the index o~{i') sampled in Step 1|. The first claim below shows that it suffices to consider the upper tail 



of Z, and show that Pr[Z > t v /2] < 1/2, and the second gives a bound on the conditional expectation of Zi'jr. 
Claim 4.7 Pr[r v < t v \ £ iju ] > Pr[Z < t v /2]. 

Proof. We first claim that Pr[r v < t v \ £ij u ] > Pr[Z < t v /2 \ £ij u ]. So, let us condition on £y U . Then if 
Z < t v /2, none of the Zy^i variables were truncated at t v , and hence Z exactly counts the total number of plays 
(by all other arms i' ^ i, from any state) that could possibly be played before the algorithm plays v in strategy 
T(i,j). Therefore, if Z is smaller than t v /2, then combining this with the fact that depth(u) < t v /2 (from 
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Lemma 4!4] (i)), we can infer that all the plays (including those of v's ancestors) that can be made before playing 



v can indeed be completed within t v . In this case the algorithm will definitely play v before t v ; hence we get that 
conditioning on £ij u , the event r v < t v holds when Z < t v /2. 

Finally, to remove the conditioning: note that Zyy is just a function of (i) the random variables \ui ji v i\, i.e., the 
random choices made by playing T(i',j'), and (ii) the constant t v = i\me(i,j,v). However, the r.vs lut^ty) 
are clearly independent of the event £ij u for i' / i since the plays of AlgMAB in one arm are independent of the 
others, and time(i, j, v) is a constant determined once the strategy forests are created in Phase II. Hence the event 
Z < t v /2 is independent of £ij u ; hence Pr[Z < t v /2 | £ij u ] = Pr[Z < t v /2], which completes the proof. ■ 

Claim 4.8 

w\7 I c'\ ■'] ^ ST prob(i',j',v') / ^ prob(i',j',v') 

1 A ' )=J1 £- ,<« Prob(»-.i^) + M , s „. £, „ , prob(i-,/, Pi ,) 

v' S.t t\me{v ,f ,v')<t w \v' S.t time(i',j',i>')=i v 



Proof. Recall the definition of Zyy in Eq (4.15): any state v' with time(i', f, v') > t v may contribute to the 



sum only if it is part of a connected component with head Head(i', f , v') such that time(Head(i / , f , v')) < t v , 
by the definition of the ordering ^. Even among such states, if time(i', f, v') > 2t v , then the truncation implies 
that Ztfj' is unchanged whether or not we include luiji >v i\ in the sum. Indeed, if lurjry\ = 1 then all of v"s 
ancestors will have their indicator variables at value 1; moreover depth(w') > t v since there is a contiguous 
collection of nodes that are played from this tree T(i', f) from time t v onwards till time(i', j' ,v') > 2t v ; so the 
sum would be truncated at value t v whenever \ni^^ v i\ = 1. Therefore, we can write 

Zi',j' < 1 (i',j',«') + •*•(*' yy) ( 4 - 16) 

v':t\me(i',j',v')<t v v' :t v <t\me(i' ,j',v')<2t v 

Recall we are interested in the conditional expectation given = j 1 . Note that Pr[l(j/ ji y\ \ a(i') = j'} = 
prob(i' , j' ,v')/prob(i' , j' , pi>) by Lemma 4.5 , hence the first sum in ( 4.16| ) gives the first part of the claimed 



bound. Now the second part: observe that for any arm i', any fixed value of = j', and any value of t' > t v , 

prob(i',j',v>)< prob(*',/y) 

v' S.t time(i' ,j',v')=t' v' S.t time(i' ,j',v')=t v 

This is because of the following argument: Any state that appears on the LHS of the sum above is part of a 
connected component which crosses t v , they must have an ancestor which is played at t v . Also, since all states 
which appear in the LHS are played at t', no state can be an ancestor of another. Hence, we can apply the second 
part of [Observation 4.3| and get the above inequality. Combining this with the fact that Pr[l(y y y) | cr(i') = 
f] = prob(i', f, v') / prob(i' , j', p^), and applying it for each value of t' € (t v , 2t v ], gives us the second term. ■ 



Equipped with the above claims, we are ready to complete the proof of [Lemma 4.6 . Employing |Claim 4.8 we get 



k w = E E E ^-'] = E E i ^ (0 = ft ■ Fr ^ 

1 x - 



JE{ E prob(i!,j',v')+tJ Yl prob(z',jV))| (4.17) 

i'¥=i j' v':t\me(i',j',v')<t v v' :t\me(i' ,j' ,v')=t v 

^(3-i v + 3-t v ) < -ty . (4.18) 



Equation (4.17) follows from the fact that each tree T(i,j) is sampled with probability prob ^' p ^ and (4.18) 



follows from Lemma 4.4 . Applying Markov's inequality, we have that Pt[Z > t v /2] < 1/2. Finally, |Claim 4.7" 



says that Pr[r v < t v \ £ij u \ > Pr[Z < t v /2] > 1/2, which completes the proof. 
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Theorem 4.9 The reward obtained by the algorithm AlgMAB is at least f2(LPOpt). 



Proof. The theorem follows by a simple linearity of expectation. Indeed, the expected reward obtained from 
any state u 6 <S; is at least V . Pr[<r(i) = j] Pr [state u is played | a{i) = j] Pt[t u 

Y2j prob ^' u > 1 ■ R u . Here, we have used |Lemmas 4.5| and |^6] for the second and third probabilities. But now 
we can use pemma 4~2 to infer that J2j P r °b(^i> u) = ^ t z Uj t, Making this substitution and summing over all 



states u £ Si and arms i completes the proof. ■ 

5 MABs with Arbitrary Transition Graphs 

We now show how we can use techniques akin to those we described for the case when the transition graph is 
a tree, to handle the case when it can be an arbitrary directed graph. A naive way to do this is to expand out 
the transition graph as a tree, but this incurs an exponential blowup of the state space which we want to avoid. 
We can assume we have a layered DAGs, though, since the conversion from a digraph to a layered DAG only 
increases the state space by a factor of the horizon B; this standard reduction appears in |Appendix E. 1| . 

While we can again write an LP relaxation of the problem for layered DAGs, the challenge arises in the rounding 
algorithm: specifically, in (i) obtaining the convex decomposition of the LP solution as in Phase I, and (ii) 
eliminating small gaps as in Phase II by advancing forests in the strategy. 

• We handle the first difficulty by considering convex decompositions not just over strategy forests, but over 



slightly more sophisticated strategy DAGs. Recall (from Figure 4.2) that in the tree case, each state in a 
strategy forest was labeled by a unique time and a unique probability associated with that time step. As the 
name suggests, we now have labeled DAGs — but the change is more than just that. Now each state has a 
copy associated with each time step in {1, . . . , B}. This change tries to capture the fact that our strategy 
may play from a particular state u at different times depending on the path taken by the random transitions 
used to reach this state. (This path was unique in the tree case.) 

Now having sampled a strategy DAG for each arm, one can expand them out into strategy forests (albeit 
with an exponential blow-up in the size), and use Phases II and III from our previous algorithm — it is 
not difficult to prove that this algorithm is a constant-factor approximation. However, the above is not a 
poly-time algorithm, since the size of the strategy forests may be exponentially large. If we don't expand 
the DAG, then we do not see how to define gap elimination for Phase II. But we observe that instead of 
explicitly performing the advance steps in Phase II, it suffices to perform them as a thought experiment — 
i.e., to not alter the strategy forest at all, but merely to infer when these advances would have happened, 
and play accordingly in the Phase III []. Using this, we can give an algorithm that plays just on the DAG, 
and argue that the sequence of plays made by our DAG algorithm faithfully mimics the execution if we 
had constructed the exponential-size tree from the DAG, and executed Phases II and III on that tree. 



The details of the LP rounding algorithm for layered DAGs follows in [Sections 5.1| - |5.3 



6 This is similar to the idea of lazy evaluation of strategies. The DAG contains an implicit randomized strategy which we make explicit 
as we toss coins of the various outcomes using an algorithm. 
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5.1 LP Relaxation 



There is only one change in the LP — constraint (5.19) now says that if a state u is visited at time t, then one of 



its ancestors must have been pulled at time t — 1; this ancestor was unique in the case of trees. 

maX Y,u,t r u ■ z u,t (LPmabdag) 

Wu,t = Zy,t-i ■ Pv,u Vte[2,B],ueS\\Ji{pi},veS (5.19) 

V 

J2t'<t w u,t' > J2t'<t z u,f Vt G [1, B], u G S (5.20) 

Eues z u,t<l Vie[l,B] (5.21) 

w Pi>1 = l Vi€[l,n] (5.22) 

Again, a similar analysis to the tree case shows that this is a valid relaxation, and hence the LP value is at least 
the optimal expected reward. 

5.2 Convex Decomposition: The Altered Phase I 

This is the step which changes the most — we need to incorporate the notion of peeling out a "strategy DAG" 
instead of just a tree. The main complication arises from the fact that a play of a state u may occur at different 
times in the LP solution, depending on the path to reach state u in the transition DAG. However, we don't need 
to keep track of the entire history used to reach u, just how much time has elapsed so far. With this in mind, we 
create B copies of each state u (which will be our nodes in the strategy DAG), indexed by (u, t) for 1 < t < B. 

The j th strategy dag H>(i,j) for arm i is an assignment of values prob(i, j, u, t) and a relation '— >' from 4-tuples 
to 4-tuples of the form (i, j, u, t) — > v, t') such that the following properties hold: 

(i) For u,v £ Si such that p UjV > and any time t, there is exactly one time t' > t + 1 such that (i, j, u, t) — > 
(i, j, v,t'). Intuitively, this says if the arm is played from state u at time t and it transitions to state v, then 
it is played from v at a unique time t 1 , if it played at all. If t' = oo, the play from v never happens. 

(ii) For any u £ Si and time t + oo, prob(i, j, u, t) = s . t (ij, v ,t>)^(i,j, u ,t) P rob (^ v, t') ■ p V)U . 

For clarity, we use the following notation throughout the remainder of the section: states refer to the states in the 
original transition DAG, and nodes correspond to the tuples n, t) in the strategy DAGs. When i and j are 
clear in context, we may simply refer to a node of the strategy DAG by (u, t). 



Equipped with the above definition, our convex decomposition procedure appears in [Algorithm 5.2[ The main 



subroutine involved is presented first ( [Algorithm 5.1| ). This subroutine, given a fractional solution, identifies the 



structure of the DAG that will be peeled out, depending on when the different states are first played fractionally 
in the LP solution. Since we have a layered DAG, the notion of the depth of a state is well-defined as the number 
of hops from the root to this state in the DAG, with the depth of the root being 0. 



Algorithm 5.1 Sub-Routine PeelStrat (i,j) 
1: mark (pi,t) where t is the earliest time s.t. z Pi ^ > and set peelProb(pj, t) = 1. All other nodes are 

un-marked and have peelProb(i), t') = 0. 
2: while 3 a marked unvisited node do 

3: let (u, t) denote the marked node of smallest depth and earliest time; update its status to visited. 
4: for every v s.t. p u , v > do 

5: if there is t' such that z v ^ > 0, consider the earliest such t' and then 

6: mark (v,t r ) and set (i,j,u,t) — > (i,j,v,t'); update peelProb(t> , t') := peelProb(-y, t') + 

peelProb(ii, t) ■ p U)V . 
7: else 

8: set (i,j,u,t) — y (i,j,v, oo) and leave peelProb(t>, oo) = 0. 
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The convex decomposition algorithm is now very easy to describe with the sub-routine in 



Algorithm 5. 1 



in hand. 



Algorithm 5.2 Convex Decomposition of Arm i 

l: set Ci <- and set loop index j «— 1. 

2: while 3 a state it € Si s.t. ^ > do 

3: run sub-routine PeelStrat to extract a DAG B(i, j) with the appropriate peelProb(u, t) values. 

4: let A <- {(it, t) s.t peelProb(it, t) / 0}. 

5: let e = min (M)eA z^tV peelProb(u, i). 

6: for every (it, £) do 

7: set prob(i, j, it, t) = e ■ peelProb(u, t). 

8: update 4,t = 4^ - Pr°b(i, j, u, i). 

9: update u4 )t+ i = wi~ t +i ~ prob(i, j, it, t) ■ p UjV for all v. 

10: setCj <- CiUB(i,j). 

ll: increment j j + 1. 



An illustration of a particular DAG and a strategy dag j) peeled off is given in Figure 5.3 (notice that the 
states iu, y and z appear more than once depending on the path taken to reach them). 





(a) DAG for some arm i (b) Strategy dag D(i, j) 

Figure 5.3: Strategy dags and how to visualize them: notice the same state played at different times. 
Now we analyze the solutions {z^,w^} created by Algorithm 5-4 



Lemma 5.1 Consider an integer j and suppose that {z^ 1 , it;- 7 1 } satisfies constraints ( 4-.10 )-( 4. 12 ) of |_P ma bd" a gi 
Then after iteration j of Step 4 the following properties hold: 



(a) D(i, j) (along with the associated prob(i, j, ., .) values) is a valid strategy dag, i.e., satisfies the conditions 
( i) and ( ii) presented above. 



(b) The residual solution {z 3 , w^} satisfies constraints ( |5.19 )-( |5.2l| ). 

(c) For any time t and state it € Si, z 3 ut — z? u t = prob(i, j, it, t). 
Proof. We show the properties stated above one by one. 



Property (a): This follows from the construction of [Algorithm 5JJ , More precisely, condition (i) is satisfied 
because in Algorithm 5.1 each (u, t) is visited at most once and that is the only time when a pair (it, t) — > (v, t') 
(with t' > t + 1) is added to the relation. For condition (ii), notice that every time a pair (it, t) — > (v, t') is 
added to the relation we keep the invariant peelProb(u, t') = Y1(w,t) s.t (i,j,w,r)->{i,j,v,t') peelProb(u> 

) 7") ' Pw,v 'i 

condition (ii) then follows since prob(.) is a scaling of peelProb(.). 

Property (b): Constraint ( 5.19| ) of LP ma bdag is clearly satisfied by the new LP solution {z- 7 , id- 7 } because of the 
two updates performed in Steps 8| and ^: if we decrease the z value of any state at any time, the id of all children 
are appropriately reduced for the subsequent timestep. 
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Before showing that the solution {z 3 , w 3 } satisfies constraint (5.2C), we first argue that after every round of the 
procedure they remain non-negative. By the choice of e in step 5 , we have prob(z, j, u,t) = e ■ peelProb(n, t) < 

„i -1 

peeiProb(« t) P ee ^ r ob(u, t) = z^~ t (notice that this inequality holds even if peelProb(u, t) = 0); consequently 
even after the update in |step 8 , z 3 ut > for all u, t. This and the fact that the constraints ( [5.19 ) are satisfied 



implies that {z 3 , w 3 } satisfies the non-negativity requirement. 

We now show that constraint ( |5.20[ ) is satisfied. Suppose for the sake of contradiction there exist some u € S 
and t € [1, B] such that {z 3 , w 3 } violates this constraint. Then, let us consider any such u and the earliest time 
t u such that the constraint is violated. For such a u, let t' u < t u be the latest time before t u where z J u ~ t } > 0. We 
now consider two cases. 



Case (i): t' u < t u . This is the simpler case of the two. Because t u was the earliest time where constraint (5.20) 
was violated, we know that Ylt'<t' w ut' — J2t'<t' z ut" Furthermore, since z Ui t is never increased during the 
course of the algorithm we know that X^"=t' +i z t t> = ®- This ^ act coupled with the non-negativity of w 3 u t 
implies that the constraint in fact is not violated, which contradicts our assumption about the tuple u, t u . 

Case (ii): t' u = t u . In this case, observe that there cannot be any pair of tuples (v,ti) — > (u,t2) s.t. t\ < t u 
and ti > t u , because any copy of v (some ancestor of u) that is played before t u , will mark a copy of u that 
occurs before t u or the one being played at t u in Step 6| of PeelStrat. We will now show that summed over all 
t' < t u , the decrease in the LHS is counter-balanced by a corresponding drop in the RHS, between the solutions 
{« J ' _1 ,iw 3 '~ 1 } and {z 3 \w 3 } for this constraint ( 5.20| ) corresponding to u and t u . To this end, notice that the 



only times when w ut i is updated (in Step 9) for t' < t u , are when considering some (v, t\) in Step 6 such that 
(v, ti) — > (u, 1%) and t\ < t% < t u . The value of 10^,^+1 is dropped by exactly prob(i, j, v,t±) ■ p V:U . But notice 
that the corresponding term z U)t2 drops by prob(i, j,u,t 2 ) = T,( v ",t") s.t (v" ,t")^(u,t 2 ) P ro] °{h3,v" ,t") ■ p v » >u . 
Therefore, the total drop in w is balanced by a commensurate drop in z on the RHS. 

Finally, constraint ( 5.21| ) is also satisfied as the z variables only decrease in value. 



Property (c): This is an immediate consequence of the Step 8 of the convex decomposition algorithm. 



As a consequence of the above lemma, we get the following. 



Lemma 5.2 Given a solution to ( |LP ma bdael ), there exists a collection of at most nB 2 \S\ strategy dags {B(i, j)} 
such that z u> t = J2j P r °b(i, 3i u i Hence, £V ■ u ^ prob(i, j, u, t) < I for all t. 

5.3 Phases II and III 



We now show how to execute the strategy dags 



of Sections 4.2.2 and 4.2.3. First we transform 



(i, j). At a high level, the development of the plays mirrors that 
D)(i,j) into a (possibly exponentially large) blown-up tree and 
show how this playing these exactly captures playing the strategy dags. Hence (if running time is not a concern), 
we can simply perform the gap-filling algorithm and make plays on these blown-up trees following Phases II and 
III in Sections 4.2.2| and 1.2.3 . To achieve polynomial running time, we then show that we can implicitly execute 
the gap-filling phase while playing this tree, thus getting rid of actually performing Phase 4.2?2\ Finally, to 
complete our argument, we show how we do not need to explicitly construct the blown-up tree, and can generate 
the required portions depending on the transitions made thus far on demand. 

5.3.1 Transforming the DAG into a Tree 

Consider any strategy dag B)(i,j). We first transform this dag into a (possibly exponential) tree by making as 
many copies of a node (i, j, u, t) as there are paths from the root to (i, j, u, t) in D(i, j). More formally, define 
BT(i, j) as the tree whose vertices are the simple paths in which start at the root. To avoid confusion, 

we will explicitly refer to vertices of the tree DT as tree-nodes, as distinguished from the nodes in IS; to simplify 
the notation we identify each tree-node in BT with its corresponding path in B. Given two tree-nodes P, P' 
in B>T(i,j), add an arc from P to P' if P' is an immediate extension of P, i.e., if P corresponds to some 
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path (i, j,ui,h) ->-...-» (i,j,u k ,t k ) in D(i, j), then P' is a path m, t\) ->•...->• (i,j,u k ,t,k) -> 
(i,j,u k+ i,t k+ i) for some node itfc+i, tfc+i). 

For a tree-node P G BT(z, j) which corresponds to the path (i,j,ui,ti) —►...—>• (i,j,u k ,t k ) in D(i,j), we 
define state(P) = u^, i.e., state(-) denotes the final state (in <%) in the path P. Now, for tree-node P G DT(i,j), 
if «x, ■ ■ ■ 5 u>k are the children of state(P) in Si with positive transition probability from state(P), then P has 
exactly k children Pi, . . . , P k with state(P) equal to m for all I G [k]. The depth of a tree-node P is defined as 
the depth of state(P). 

We now define the quantities time and prob for tree-nodes in DT(i, j). Let P be a path in D(i, j) from pi to node 
(i, j, u, t). We define time(P) := t and prob(P) := prob(P')p( state (p/) u \, where P' is obtained by dropping the 
last node from P. The blown-up tree DT(i,j) of our mnning example D(i, j) ( pigure 53 ) is given in Figure 5.4. 



Lemma 5.3 For any state u and time t, Y^p s 1 1 



nd state(P)=« Prob(P) = prob(t, J, «, t). 



ime(P)=t and state(P) 




Figure 5.4: Blown-up Strategy Forest DT(i, j) 
Now that we have a tree labeled with prob and time values, the notions of connected components and heads from 



Section 1.2.2 carry over. Specifically, we define Head(P) to be the ancestor P' of P in WE(i,j) with least depth 
such that there is a path (P' = Pi — > ...—>■ p = P) satisfying time(p) = time(p_i) + 1 for all i G [2, /], i.e., 
the plays are made contiguously from Head(P) to P in the blown-up tree. We also define comp(P) as the set of 
all tree-nodes P' such that Head(P) = Head(P'). 

In order t o play the strategies DT(i, 7) we first eliminate small gaps. The algorithm GapFill presented in Sec - 



tion 4.2.2 can be employed for this purpose and returns trees DT (i, j) which satisfy the analog of Lemma 4.4 



Lemma 5.4 The trees returned by GapFill satisfy the fallowings properties. 

(i) For each tree-node P such that r state (p) > 0, time(Head(P)) > 2 • depth(Head(P)). 

(ii) The total extent of plays at any time t, i.e., 2^p-time(P)=t P r °b(P) is at most 3. 



Now we use [Algorithm 4.2 to play the trees DT(i, j). We restate the algorithm to conform with the notation used 



in the trees DT(i, j). 

Now an argument identical to that for Theorem ^ gives us the following: 

Theorem 5.5 The reward obtained by the algorithm AlgDAG is at least a constant fraction of the optimum for 



( LPmabdag ) 



5.3.2 Implicit gap filling 

Our next goal is to execute GapFill implicitly, that is, to incorporate the gap-filling within Algorithm AlgDAG 
without having to explicitly perform the advances. 

To do this, let us review some properties of the trees returned by GapFill. For a tree-node P in BT(i,j), let 
time(P) denote the associated time in the original tree (i.e., before the application of GapFill) and let time'(P) 
denote the time in the modified tree (i.e., after OT(i, j) is modified by GapFill). 
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Algorithm 5.3 Scheduling the Connected Components: Algorithm AlgDAG 



1: for arm i, sample strategy WT(i,j) with probability pro * 24 — ; ignore arm i w.p. 1 — 

E prob(root(DT(i,j))) 
3 24 

2: let A <— set of "active" arms which chose a strategy in the random process. 

3: for each i € A, let a{i) <— index j of the chosen DT(i, j) and let currnode(i) ^— root of DT(i, cr(i)). 

4: while active arms A ^ do 

5: let i* <— arm with tree-node played earliest (i.e., i* ^— argmmj 6/l {time(currnode(i))}). 

6: let r ^— time(currnode(i*)). 

7: while time(currnode(i*)) / oo and time(currnode(i*)) = r do 

8: play arm i* at state state(currnode(i*)) 

9: let u be the new state of arm i* and let P be the child of currnode(i*) satisfying state(P) = u. 

10: update currnode(i*) to be P; let r <s— r + 1. 

11: if time(currnode(i*)) = oo then 

12: let ,4 <- A\ {i*} 



Claim 5.6 For a non-root tree-node P and its parent P', time'(P) = time^P') + 1 if and only if, either 
time(P) = time(P') + 1 or 2 ■ depth(P) > time(P). 

Proof. Let us consider the forward direction. Suppose time'(P) = time^P') + 1 but time(P) > time(P') + 1. 
Then P must have been the head of its component in the original tree and an advance was performed on it, so 
we must have 2 • depth(P) > time(P). 

For the reverse direction, if time(P) = time(P') + 1 then P could not have been a head since it belongs to 
the same component as P' and hence it will always remain in the same component as P' (as GapFill only 
merges components and never breaks them apart). Therefore, time'(P) = time'(P') + 1. On the other hand, if 
time(P) > time(P') + 1 and 2 • depth(P) > time(P), then P was a head in the original tree, and because of the 
above criterion, GapFill must have made an advance on P' thereby including it in the same component as P; so 
again it is easy to see that time'(P) = time^P') + 1. ■ 

The crucial point here is that whether or not P is in the same component as its predecessor after the gap-filling 
(and, consequently, whether it was played contiguously along with its predecessor should that transition happen 
in AlgDAG) can be inferred from the time values of P, P' before gap-filling and from the depth of P — it does 
not depend on any other advances that happen during the gap-filling. 

Algorithm |5^4|is a procedure which plays the original trees 3T(i,j) while implicitly performing the advance 



steps of GapFill (by checking if the properties of Claim |J hold). This change is reflected in Step 7 where 



we may play a node even if it is not contiguous, so long it satisfies the above stated properties. Therefore, as a 



consequence of Claim ^g, we get the following Lemma that the plays made by ImplicitFill are identical to those 



made by AlgDAG after running GapFill. 

Lemma 5.7 Algorithm ImplicitFill obtains the same reward as algorithm AlgDAG o GapFill. 
5.3.3 Running ImplicitFill in Polynomial Time 

With the description of ImplicitFill, we are almost complete with our proof with the exception of handling the 
exponential blow-up incurred in moving from D to DT. To resolve this, we now argue that while the blown-up 
BT made it easy to visualize the transitions and plays made, all of it can be done implicitly from the strategy 
DAG ID. Recall that the tree-nodes in DT(i, j) correspond to simple paths in j). In the following, the final 
algorithm we employ (called ImplicitPlay) is simply the algorithm ImplicitFill, but with the exponentially blown- 
up trees DT(i, a(i)) being generated on-demand, as the different transitions are made. We now describe how this 
can be done. 



21 



Algorithm 5.4 Filling gaps implicitly: Algorithm ImplicitFill 



1: for arm i, sample strategy BT(i,j) with probability pro * 24 — J ; ignore arm i w.p. 1 — 

E prob(root(DT(i,j))) 
j 24 

2: let A <— set of "active" arms which chose a strategy in the random process. 

3: for each i € A, let a(i) <— index j of the chosen B T(i, j) and let currnode(i) «— root of BT(i, <r(i)). 
4: while active arms A 7^ do 

5: let i* <— arm with state played earliest (i.e., i* ^— argmin ig ^{time(currnode(f))}). 
6: let r <— time(currnode(i*)). 

7: while time(currnode(z*)) 7^ 00 and (time(currnode(z*)) = r or 2 • depth(currnode(i*)) > 

time(currnode(z*))) do 
8: play arm i* at state state(currnode(i*)) 

9: let u be the new state of arm i* and let P be the child of currnode(i*) satisfying state(P) = u. 
10: update currnode(i*) to be P; let r <— r + 1. 
11: if time(currnode(i*)) = 00 then 
12: let A ^ A \ {it*} 



In Step H of ImplicitFill, we start off at the roots of the trees BT(i, o~(i)), which corresponds to the single-node 
path corresponding to the root of B(z, <r(i)). Now, at some point in time in the execution of ImplicitFill, suppose 
we are at the tree-node currnode(z*), which corresponds to a path Q in B(i, a(i)) that ends at (£, a(i), v, t) for 
some v and t. The invariant we maintain is that, in our algorithm ImplicitPlay, we are at node (i,a(i),v,t) 
in B(z, a(i)). Establishing this invariant would show that the two runs ImplicitPlay and ImplicitFill would be 
identical, which when coupled with Theorem 5^5 would complete the proof — the information that ImplicitFill 
uses of Q, namely time(Q) and depth(Q), can be obtained from (i, a(i),v, t). 

The invariant is clearly satisfied at the beginning, for the different root nodes. Suppose it is true for some tree- 
node currnode(i), which corresponds to a path Q in D(i, a(i)) that ends at (i, a(i),v, t) for some v and t. Now, 
suppose upon playing the arm i at state v (in Step [|), we make a transition to state u (say), then ImplicitFill would 
find the unique child tree-node P of Q in B>T(i, <r(i)) with state(P) = u. Then let (i, a(i),u, t') be the last node 
of the path P, so that P equals Q followed by (i, a(i), u, t'). 

But, since the tree BT(z,cr(z)) is just an expansion of B(i, cr(i)), the unique child P in DT(i, of tree- 
node Q which has state(P) = u, is (by definition of BT) the unique node (i, a(i), u, if) of B(i, a(i)) such that 
(i,a(i),v,t) — > (i,a(i),u,t'). Hence, just as ImplicitFill transitions to P in DT(i, cr(i)) (in Step ^|), we can 
transition to the state (i, a(i),u, t') with just B at our disposal, thus establishing the invariant. 

For completeness, we present the implicit algorithm below. 



6 Concluding Remarks 

We presented the first constant-factor approximations for the stochastic knapsack problem with cancellations and 
correlated size/reward pairs, and for the budgeted learning problem without the martingale property. We showed 
that existing LPs for the restricted versions of the problems have large integrality gaps, which required us to give 
new LP relaxations, and well as new rounding algorithms for these problems. 

Acknowledgments. We thank Kamesh Munagala and Sudipto Guha for useful conversations. 
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A Some Bad Examples 
A.l Badness Due to Cancelations 



We first observe that the LP relaxation for the StocK problem used in [DGV08] has a large integrality gap in 
the model where cancelations are allowed, even when the rewards are fixed for any item. This was also noted 
in [ pea05 ]. Consider the following example: there are n items, every item instantiates to a size of 1 with 



probability 0.5 or a size of n/2 with probability 0.5, and its reward is always 1. Let the total size of the knapsack 
be B = n. For such an instance, a good solution would cancel any item that does not terminate at size 1; this 
way, it can collect a reward of at least n/2 in expectation, because an average of n/2 items will instantiate with a 
size 1 and these will all contribute to the reward. On the other hand, the LP from [ pGV08 ] has value 0(1), since 



the mean size of any item is at least n/4. In fact, any strategy that does not cancel jobs will also accrue only 0(1) 
reward. 

A.2 Badness Due to Correlated Rewards 

While the LP relaxations used for MAB (e.g., the formulation in [ pM07a| ]) can handle the issue explained above 
w.r.t cancelations, we now present an example of stochastic knapsack (where the reward is correlated with the 
actual size) for which the existing MAB LP formulations all have a large integrality gap. 

Consider the following example: there are n items, every item instantiates to a size of 1 with probability l — l/n 
or a size of n with probability 1/n, and its reward is 1 only if its size is n, and otherwise. Let the total size of 
the knapsack be B = n. Clearly, any integral solution can fetch an expected reward of 1/n — if the first item it 
schedules instantiates to a large size, then it gives us a reward. Otherwise, no subsequent item can be fit within 
our budget even if it instantiates to its large size. The issue with the existing LPs is that the arm-pull constraints 
are ensured locally, and there is one global budget. That is, even if we play each arm to completion individually, 
the expected size (i.e., number of pulls) they occupy is 1 • (1 — 1/n) + n ■ (1/n) < 2. Therefore, such LPs can 
accommodate n/2 jobs, fetching a total reward of S7(l). This example brings to attention the fact that all these 
item are competing to be pulled in the first time slot (if we begin an item in any later time slot it fetches zero 



reward), thus naturally motivating our time-indexed LP formulation in Section p.2 . 

In fact, the above example also shows that if we allow ourselves a budget of 2B, i.e., 2n in this case, we can in 
fact achieve an expected reward of 0(1) (much higher than what is possible with a budget of B) — keep playing 
all items one by one, until one of them does not step after size 1 and then play that to completion; this event 
happens with probability Q(l). 
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A.3 Badness Due to the Non-Martingale Property in MAB: The Benefit of Preemption 



Not only do cancelations help in our problems (as can be seen from the example in Appendix |A. lh , we now show 
that even preemption is necessary in the case of MAB where the rewards do not satisfy the martingale property. In 
fact, this brings forward another key difference between our rounding scheme and earlier algorithms for MAB — 
the necessity of preempting arms is not an artifact of our algorithm/analysis but, rather, is unavoidable. 

Consider the following instance. There are n identical arms, each of them with the following (recursively defined) 
transition tree starting at p(0): 

When the root p(j) is pulled for j < m, the following two transitions can happen: 

(i) with probability l/(n ■ n m ~i\ the arm transitions to the "right-side", where if it makes B — n(^l =0 L k ) 
plays, it will deterministically reach a state with reward n rn ~ J . All intermediate states have reward. 

(ii) with probability 1 — l/(n ■ n m ~ J ), the arm transitions to the "left-side", where if it makes U +1 — 1 plays, 
it will deterministically reach the state p(j '• + 1). No state along this path fetches any reward. 

Finally, node p{m) makes the following transitions when played: (i) with probability 1/n, to a leaf state that has 
a reward of 1 and the arm ends there; (ii) with probability 1 — 1/n, to a leaf state with reward of 0. 

For the following calculations, assume that B S> L > n and m S> 0. 

Preempting Solutions. We first exhibit a preempting solution with expected reward Q,(m). The strategy plays 
p(0) of all the arms until one of them transitions to the "right-side", in which case it continues to play this until 
it fetches a reward of n m . Notice that any root which transitioned to the right-side can be played to completion, 
because the number of pulls we have used thus far is at most n (only those at the p(0) nodes for each arm), and 
the size of the right-side is exactly B — n. Now, if all the arms transitioned to the left-side, then it plays the 
p{l) of each arm until one of them transitioned to the right-side, in which case it continues playing this arm and 
gets a reward of n m ~ l . Again, any root p(l) which transitioned to the right-side can be played to completion, 
because the number of pulls we have used thus far is at most n(l + L) (for each arm, we have pulled the root 
p(0), transitioned the walk of length L — 1 to p(l) and then pulled p(l)), and the size of the right-side is exactly 
B — n(l + L). This strategy is similarly defined, recursively. 

We now calculate the expected reward: if any of the roots p(0) made a transition to the right-side, we get a 
reward of n rn . This happens with probability roughly l/n m , giving us an expected reward of 1 in this case. If 
all the roots made the transition to the left-side, then at least one of the p(l) states will make a transition to their 
right-side with probability l/n m_1 in which case will will get reward of ti" 1 " 1 , and so on. Thus, summing 
over the first m/2 such rounds, our expected reward is at least 

n m \ n m J n m ~ l V n m J \ n m ~ l J n m ~ 2 

Each term above is 0(1) giving us a total of Q(m) expected reward. 

Non-Preempting Solutions. Consider any non-preempting solution. Once it has played the first node of an arm 
and it has transitioned to the left-side, it has to irrevocably decide if it abandons this arm or continues playing. 
But if it has continued to play (and made the transition of L — 1 steps), then it cannot get any reward from the 
right-side of p(0) of any of the other arms, because L > n and the right-side requires B — n pulls before reaching 
a reward-state. Likewise, if it has decided to move from p(i) to p(i + 1) on any arm, it cannot get any reward 
from the right-sides of p(0), p(l), . . . , p(i) on any arm due to budget constraints. Indeed, for any i > 1, to have 
reached p(i + 1) on any particular arm, it must have utilized (1 + L — 1) + (1 + L 2 — 1) + . . . + (1 + L t+1 — 1) 
pulls in total, which exceeds n(l + L + L 2 + . . . + U) since L > n. Finally, notice that if the strategy has decided 
to move from p(i) to p(i + 1) on any arm, the maximum reward that it can obtain is n m ~ l ~ l , namely, the reward 
from the right-side transition of p(i + 1). 
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Using these properties, we observe that an optimal non-preempting strategy proceeds in rounds as described next. 



Strategy at round i. Choose a set iVj of available arms and play them as follows: pick one of these arms, play 
until reaching state p(i) and then play once more. If there is a right-side transition before reaching state p(i), 
discard this arm since there is not enough budget to play until reaching a state with positive reward. If there is a 
right-side transition at state p(i), play this arm until it gives reward of n m ~ l . If there is no right-side transition 
and there is another arm in Ni which is still to be played, discard the current arm and pick the next arm in iVj. 

In round i, at least max(0, re, — 1) arms are discarded, hence £\ rij < 2n. Therefore, the expected reward can 
be at most 



n ■ n" 



-n m + 



n 2 



n • n 



m—1 



n m ~ x + 



+ ^<2 
n 



B Proofs from Section @ 



B.l Proof of Theorem 2.3 



Let add j denote the event that item i was added to the knapsack in Step 5 Also, let V{ denote the random variable 
corresponding to the reward that our algorithm gets from item i. 

Clearly if item i has Di = t and was added, then it is added to the knapsack before time t. In this case it is easy 
to see that E[T^ | addj A (Di = t)] > Rij (because its random size is independent of when the algorithm started 
it). Moreover, from the previous lemma we have that Pr(add 

Pr(A = t) = hence Pr(add; A (A = *)) > x *,t, 
bound the expected value of Vi by 



(Di = t)) > 1/2 and from |Step 1| we have 
Finally adding over all possibilities of t, we lower 



E[Vi\ > ^E[y, I add; A (Di = t)] • Pr(add, A (A = t)) > \Y. x h R 



i.t- 



Finally, linearity of expectation over all items shows that the total expected reward of our algorithm is at least 
I ' J2i t x it^i,t = LPOpt/8, thus completing the proof. 

B.2 Making StocK-NoCancel Fully Polynomial 



Recall that our LP relaxation |LP|\| Cancei| in Section g uses a glo bal time-indexed LP. In order to make it compact, 
our approach will be to group the B timeslots in LP|\| c an cei an d show that the grouped LP has optimal value 



within constant factor of |LP|\| Cancei| ; furthermore, we show also that it can be rounded and analyzed almost 
identically to the original LP. To this end, consider the following LP relaxation: 



Z^j=o x i,v A i 

SM"< J ' x ^'-IE[min(^,2^ 1 )]<2.2^ 
Xlv G [0, 1] 



Vj G [0,logS] 
Vj € [0,logJ3],Vt 



(PolyLP L ) 
(B.23) 
(B.24) 
(B.25) 



The next two lemmas relate the value of (Polyl_P L ) to that of the original LP (LP|\| c an cei ) 



Lemma B.l The optimum of ( |Polyl_P f J ) is at least half of the optimum of dLPNoCanceiD - 



Proof. Consider a solution x for ( |LP|\| canceil ) an d define xa = Xj,i/2 +^2te\2A) an ^ x i,2i = Ste[23+ 1 2j+ 2 ) x i,t/2 
for 1 < j < log B. It suffices to show that x is a feasible solution to ( Polyl_P f J ) with value greater than of equal 
to half of the value of x. 
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For constraints ( |B.23| ) we have Y^j*=o x i,2J = Ylt>i x i,t/^ — 1/2; these constraints are therefore easily satisfied. 
We now show that {x} also satisfies constraints ( |B.24 ): 



E 

i,j'<3 



E[min(5 i ,2^ 1 )] = X; E 



Xj j (E[min(S'j, 2 3 



t=i 



23+2-1 



^ x M E[min(g t ,2^+ 2 -l)] < 2?+2 ^ 

j i=l 



where the last inequality follows from feasibility of {x}. 

Finally, noticing that ERj t is non-increasing with respect to t, it is easy to see that J2 i X^fcf ER i2 j+i • x ij2 j > 
t ERi, t ■ Xi t t/2 and hence x has value greater than of equal to half of the value of x ad desired. ■ 



Lemma B. 2 Let {x} be a feasible solution fo r (PolyLP^J) . Define {x} satisfying x^t = x i 2 j j2 3 for all t € 
[2- ? , 2 J+1 ) and i £ [n]. Then {x} is feasible for ( LP|\| cancei| ) and has value at least as large as {x}. 



Proof. The feasibility of {x} directly imply that {x} satisfies constraints (2T). For constraints (|2.2[), consider 
t G [2 J , 2 J+1 ); then we have the following: 

E V • E[mm(^,t)] < E E E ^ i E[min(5 l ,2^ 1 )] 

t,t'<t i i'<jte[23',23 / + 1 ) 

= E E X i,2i®[™-M s h 2 J+1 )] < 2 • 2 J < 2i. 

i j'<3 



Finally, again using the fact that ER.j t is non-increasing in t we get that the value of {x} is 

log B _ log B 



EE E er,^>EE E er, 



i.t 



2J+ 1 ■ 



x i,2i 



i 3=0 te[Z>,Z> +1 ) i i=0 te[2i,2i+ 1 ) 

which is then at least as large as the value of {x}. This concludes the proof of the lemma 



logB 

E E ER i,23 +1 ^,23, 
i j=0 



The above two lemmas show that the PolyLP L has value close to that of LPNoCanceiT - l ets now show that we can 
simulate the execution of Algorithm StOCK-Large just given an optimal solution {x} for ( |PolyLP f ' ). Let {x} 
be defined as in the above lemma, and consider the Algorithm StOCK-Large applied to {x}. By the definition 
of {x}, here's how to execute Step 1 (and hence the whole algorithm) in polynomial time: we obtain Di = t 
by picking j € [0, log B] with probability x i 2 j and then selecting t € [2 J , 2 J+1 ) uniformly; notice that indeed 
Di = t (with t € [2>, 2 3+l )) with probability x i>2J /2 j = x ijt . 

Using this observation we can obtain a 1/16 approximation for our instance 1 in polynomial time by finding 
the optimal solution {x} for ( polyLP/ ) and then running Algorithm StOCK-Large over {x} as described in the 
previous paragraph. Using a direct modification of [Theorem 2.3 we have that the strategy obtained has expected 
reward at least at large as 1/8 of the value of {x}, which by Lemmas B.l and B^2 (and Lemma 2.1 ) is within a 
factor of 1/16 of the optimal solution for X. 

C Proofs from Section @ 
C.l Proof of Lemma 



The proof works by induction. For the base case, consider t = 0. Clearly, this item is forcefully canceled in 
step 4| of Algorithm 3T StOCK-Small (in the iteration with t = 0) with probability s* Q /v* Q — Ki,o/Ylt'>o ^it'- 



ll 



But since 71^0 was assumed to be and v* is 1, this quantity is exactly s* , and this proves property (i). For 



property (ii), item i is processed for its 1 timestep if it did not get forcefully canceled in ptep 4[ This therefore 
happens with probability 1 — s* = v* — s? = v* v For property (iii), conditioned on the fact that it has 
been processed for its I s * timestep, clearly the probability that its (unknown) size has instantiated to 1 is exactly 



^,1/ Ylt'>i n i,t'- When this happens, the job stops in step 7 , thereby establishing the base case. 



Assuming this property holds for every timestep until some fixed value t — 1, we show that it holds for t; the 
proofs are very similar to the base case. Assume item i was processed for the t th timestep (this happens w.p v* t 
from property (ii) of the induction hypothesis). Then from property (iii), the probability that this item completes 



at this timestep is exactly '^i,t/Ylf>t' K hf- Furthermore, it gets forcefully canceled in [step 4 with probability 



s it/ v tt ~ ' K i,t/'52t'>t ^if- Thus the total probability of stopping at time t, assuming it has been processed for 
its t th timestep is exactly s* t /v* t ; unconditionally, the probability of stopping at time t is hence s* t . 

Property (ii) follows as a consequence of Property (i), because the item is processed for its (t + l) st timestep 
only if it did not stop at timestep t. Therefore, conditioned on being processed for the t th timestep, it continues 
to be processed with probability 1 — s* t /v* t . Therefore, removing the conditioning, we get the probability of 
processing the item for its (t + l) s * timestep is v* t — s* t = v* t+1 . Finally, for property (iii), conditioned on 
the fact that it has been processed for its (t + l) s * timestep, clearly the probability that its (unknown) size has 
instantiated to exactly (t + 1) is ^i,t+i/^2t'>t + 1 n ht'- When this happens, the job stops in step 7 of the algorithm. 



C.2 StocK with Small Sizes: A Fully Polytime Algorithm 



The idea is to quantize the possible sizes of the items in order to ensure that LP LPg| has polynomial size, then 
obtain a good strategy (via Algorithm StOCK-Small) for the transformed instance, and finally to show that this 
strategy is actually almost as good for the original instance. 

Consider an instance X = (-7T, R) where Rij = for all t > B/2. Suppose we start scheduling an item at some 
time; instead of making decisions of whether to continue or cancel an item at each subsequent time step, we are 
going to do it in time steps which are powers of 2. To make this formal, define instance X = (ft , R) as follows: 

set ffj i2 j = EiG[2J,2J+ i ) n i,t and = (Ei G [2J>+!) n,tRi,t) /^i,23 for a11 i e N and j e {0, 1, . . . , Liog-BJ}- 

The instances are coupled in the natural way: the size of item i in the instance X is 2 J iff the size of item i in the 
instance X lies in the interval [2 J ', 2 J+1 ). 



In [Section 3.1\ a timestep of an item has duration of 1 time unit. However, due to the construction of X, it is 
useful to consider that the t th time step of an item has duration 2*; thus, an item can only complete at its th , 1 st , 
2 nd , etc. timesteps. With this in mind, we can write an LP analogous to (LPs): 



i,23 



max I]l<j<log( J B/2) Hl<i<n V i,2i ' #i,2J £..,>f^ 
> 



s i,2i + v i,2i+l 



'i.2J 



j'>3 Hz* 

J2ie[n] Y^je[o,\ogB] 

Vifi = 1 

«i,2i . s i,2J e [0, 1] 



(PolyLP 5 ) 

Vj G [0,logB], i G [n] (C.26) 
Vt € [0,logB], i G [n] (C.27) 



2 J • s L23 < B 



Vj G [0,log£], i G [n] 



(C.28) 
(C.29) 
(C.30) 



Notice that this LP has size polynomial in the size of the instance X. 

Consider the LP ( LPg| ) with respect to the instance X and let (v, s) be a feasible solution for it with objective 
value z. Then define (v, s) as follows: v i ^i = v i 2 j and s ij2 i = Yl,t£[2i 2J+ 1 ) s hi- ^ ^ s eas y to cnec k that (v, s) is 
a feasible solution for ( PolyLP j ) with value at least z, where the latter uses the fact that Vi t is non-increasing in 
t. Using Theorem 3.1 it then follows that the optimum of ( PolyLP^ with respect to (ft, R) is at least as large as 
the reward obtained by the optimal solution for the stochastic knapsack instance (ir, R). 
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Let (v, s) denote an optimal solution of ( PolyLP^ ). Notice that with the redefined notion of timesteps we can 



naturally apply Algorithm StOCK-Small to the LP solution (v, s). Moreover, [Lemma 3 still holds in this setting 



Finally, modify Algorithm StOCK-Small by ignoring items with probability 1 — 1/8 = 7/8 (instead of 3/4) in 



Step 2 (we abuse notation slightly and shall refer to the modified algorithm also as StOCK-Small) and notice that 



Lemma 3.2 still holds 



Consider the strategy § for 1 obtained from Algorithm StOCK-Small. We can obtain a strategy S for I as follows: 
whenever § decides to process item i of X for its jth timestep, we decide to continue item i of X while it has size 
from V to 2 j+1 - 1. 

Lemma C.l Strategy § is a 1/16 approximation for X. 

Proof. Consider an item i. Let O be the random variable denoting the total size occupied before strategy § starts 
processing item i and similarly let O denote the total size occupied before strategy § starts processing item i. 
Since Lemma 3~3j still holds for the modified algorithm StOCK-Small, we can proceed as in [Theorem 3.3| and 



obtain that E[0] < B/8. Due to the definition of § we can see that O < 20 and hence E[0] < B/4. From 
Markov's inequality we obtain that Pr(0 > B/2) < 1/2. Noticing that i is started by S with probability 1/8 
we get that the probability that i is started and there is at least B/2 space left on the knapsack at this point 
is at least 1/16. Finally, notice that in this case § and § obtain the same expected value from item i, namely 

V v i 2 j ■ Ri 2 j v — ~ ■ Thus § get expected value at least that of the optimum of ( |PolyLPg[ ), which is at least 

the value of the optimal solution for X as argued previously. ■ 

D Details from Section g] 



D.l Details of Phase I (from Section 4.2.1) 



We first begin with some notation that will be useful in the algorithm below. For any state u £ Si such that the 
path from pi to u follows the states u\ = pi, U2, ■ ■ ■ , Uk = u, let tt u = T±fZ\Pv,i,u i+1 - 



Fix an arm i, for which we will perform the decomposition. Let {z, w} be a feasible solution to LP ma b and set 
z u,t = z u,t and w® t = w U) t for all u € Si, t 6 [B], We will gradually alter the fractional solution as we build the 
different forests. We note that in a particular iteration with index j, all z^~ x , w^ 1 values that are not updated in 
Steps 12 and |TJ are retained in , respectively. For brevity of notation, we shall use "iteration j of |step 2 



to 



Algorithm D.l Convex Decomposition of Arm i 



1 

2 
3 
4 
5 
6 
7 

8 
9 

10: 
11 
12 

13 
14 
15 



set Ci and set loop index j 4- 1. 

while 3 a node u € Si s.t ^ t z^ 1 > do 
initialize a new tree T(i, j) = 0. 
set A <- {u G Si S.t Ylt 4/ > °}- 

for all u £ Si, set time(i, j, u) oo, prob(i, j, u) <— 0, and set e u <— oo. 
for every u € A do 

update time(i, j, u) to the smallest time t s.t z^ 1 > 0. 

update e u = zP~- r . ,/tt u 
let e = min^ e u . 
for every u € A do 

set prob(i, j, u) = e ■ tt u . 

update < ;time(ij>) = 4^ me{iiju) - prob(i,j». 

update ^ jtime(lJ>)+1 = ™l~ t Le(i,j,u)+i ~ ^ob{i,j, u) ■ p UjV for all v s.t parent(w) = u. 
setCi <r- CiUT(i,j). 
increment j <— j + 1. 



denote the execution of the entire block (steps 3 - 14) which constructs strategy forest T(i, j). 
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Lemma D.l Consider an integer j and suppose that {z- 



•i-i 



Then after iteration j of Step 2[ the following properties hold: 



_1 } satisfies constraints ( 4.10| )-( 4. 12 ) of LP ma b 



(a) T(i,j) (along with the associated prob(i, j, .) andt\tr\e(i, j, .) values) is a valid strategy forest, i.e., satisfies 
the conditions (i) and (ii) presented in Section \l.2.1 . 



(b) The residual solution {z 3 , w 3 } satisfies constraints ( |4.10| )-( |4~T2 ). 

(c) For any time t and state u G Si, z^ 



prob(i,j»l time(ij>)= f. 



Proof. We show the properties stated above one by one. 

Property (a): We first show that the time values satisfy time(i,j, u) > t\me(i,j, parent(u)) + 1, i.e. condition 
(i) of strategy forests. For sake of contradiction, assume that there exists u € Si with v = parent(n) where 
t\me(i,j,u) < t\me(i,j,v). Define t u = t\me(i,j,u) and t v = time(i, j, parent(w)); the way we updated 



time(i,_7, u) in step 7 gives that z 3 u 1 1 >0 



Then, constraint ( |4. 1 1| ) of the LP implies that J2t'<t u ^ e > ®- ^ n particular, there exists a time t' < t u < t 



such that w J ut } > 0. But now, constraint ( |4.10| ) enforces that ^ t /_ 1 = w J ut } /p v ,u > as well. But this 



contradicts the fact that t v was the first time s.t z J v 1 1 > 0. Hence we have time(i, j, u) > time(z, j, parent (ii)) + l. 



As for condition (ii) about prob(i, j, .), notice that if time(i, j, u) / oo, then prob(i, j, u) is set to e-7r u in |step 1 1 
It is now easy to see from the definition of ir u (and from the fact that time(i, j, u) ^ oo 
oo) that prob(i, j,u) = prob(i,j, parent(u)) • p paren t(u),u- 



time(i, j, parent(n)) / 

Property (b): Constraint ( |4.10| ) of LP ma b| is clearly satisfied by the new LP solution {z^ ,w^} because of the two 



updates performed in |Steps 12| and |13f if we decrease the z value of any node at any time, the w of all children 
are appropriately reduced (for the subsequent timestep). 



Before showing that the solution {z J , w J } satisfies constraint ( 4.11 ), we first argue that they remain non-negative 



By the choice of e in step ||, we have prob(f, j, u) = eir u < e u 7r u 
Step 8 ); consequently even after the update in step |l2[ z 3 u time u j u 
constraints ( [4.1 0| ) are satisfied implies that , w? } satisfies the non-negativity requirement. 



: z^ 4 .. 1 , . . > (where e„ was computed in 
> for all u. This and the fact that the 



We now show that constraint (4.11) is satisfied. For any time t and state u £ A (where A is the set computed 
in step § for iteration j), clearly it must be that Ylt'<t z ^Tt = by definition of the set A; hence just the 
non-negativity of w'-i implies that these constraints are trivially satisfied. 

Therefore consider some t € [B] and a state u € A. We know from step [7] that t\me(i,j,u) / oo. If t < 
t\me(i,j, u), then the way time(i, j, u) is updated in step ^ implies that J2t'<t z t t> = Ylt'<t ^uv = ^» so tne 
constraint is trivially satisfied because is non-negative. If t > time(i,j, u), we claim that the change in the 



left hand side and right hand side (between the solutions {z- 



1 ,v> 1 



} and {z 3 ,w 3 }) of the constraint under 



consideration is the same, implying that it will be still satisfied by {z 3 ,w J }. 
To prove this claim, observe that the right hand side has decreased by exactly z 3 Z 

r ' ° J J M,time(i,j,u) M,time(ij,u) 

probfi, 7, u). But the only value which has been modified in the left hand side is w 3 ~} ,. . „, . , , which 
has gone down by prob(i, j, parent(u)) • p par ent(u),ti- Because T(i,j) forms a valid strategy forest, we have 
prob(z, j, u) = prob(i, j, parent(n)) • p parent (u),w an d thus the claim follows. 

Finally, constraint ( |4. 12| ) are also satisfied as the z variables only decrease in value over iterations. 

Property (c): This is an immediate consequence of the [Step 12 . ■ 



To prove Lemma 4.2, firstly notice that since {z°,w } satisfies constraints (4.10)-(4.12), we can proceed by 
induction and infer that the properties in the previous lemma hold for every strategy forest in the decomposition; 
in particular, each of them is a valid strategy forest. 
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In order to show that the marginals are preserved, observe that in the last iteration j* of procedure we have 



c u,t 



for all u, t. Therefore, adding the last property in the previous lemma over all j gives 



i>i 



^prob(z,j»l ti 



m e(i,j,u)=t= prob(z,j». 

j:time(i ,j,u)=t 



Finally, since some z J u t gets altered to since in each iteration of the above algorithm, the number of strategies 



for each arm in the decomposition is upper bounded by B\S\. This completes the proof of |Lemma 4.2 . 
D.2 Details of Phase II (from Section H33) 



Proof of Lemma 4.4: Let time* (u) denote the time assigned to node u by the end of round r = t of the algorithm; 
t\me B+l (u) is the initial time of u. Since the algorithm works backwards in time, our round index will start at B 
and end up at 1. To prove property (i) of the statement of the lemma, notice that the algorithm only converts head 
nodes to non-head nodes and not the other way around. Moreover, heads which survive the algorithm have the 
same time as originally. So it suffices to show that heads which originally did not satisfy property (i) — namely, 
those with time B+1 (v) < 2 ■ depth (v) — do not survive the algorithm; but this is clear from the definition of Step 

I 

To prove property (ii), fix a time t, and consider the execution of GapFill at the end of round r = t. We 
claim that the total extent of fractional play at time t does not increase as we continue the execution of the 
algorithm from round r = t to round 1. To see why, let C be a connected component at the end of round 
t = t and let h denote its head. If time*(/i) > t then no further advance affects C and hence it does not 
contribute to an increase in the number of plays at time t. On the other hand, if time*(/t) < t, then even if C 
is advanced in a subsequent round, each node w of C which ends up being played at t, i.e., has time 1 (w) = t 
must have an ancestor w' satisfying time (n/) = t, by the contiguity of C. Thus, Observation 4.3 gives that 

Et l eC:time 1 («)=* P rob O) < EueC:time*(u)=t P rob ( M )- A PP!y in g this f «r each connected component C, proves 
the claim. Intuitively, any component which advances forward in time is only reducing its load/total fractional 
play at any fixed time t. 



t 



t 



In 



hi 
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(a) Connected components in the beginning 
of the algorithm 



(b) Configuration at the 
end of iteration r = t 



Figure D.5: Depiction of a strategy forest T(i, j) on a timeline, where each triangle is a connected component. 
In this example, H = {/12, /15} and Ch 2 consists of the grey nodes. From Observation 43 the number of plays at 
t do not increase as components are moved to the left. 

Then consider the end of iteration r = t and we now prove that the fractional extent of play at time t is at most 
3. Due to Lemma 4T2 , it suffices to prove that J2ueu P rob (w) < 2, where U is the set of nodes which caused an 
increase in the number of plays at time t, namely, U = {u : time s+1 (it) > t and time* (it) = t}. 

Notice that a connected component of the original forest can only contribute to this increase if its head h crossed 
time t, that is t\me B+1 (h) > t and time*(/i) < t. However, it may be that this crossing was not directly caused 
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by an advance on h (i.e. h advanced till time B+1 (parent(/i)) > t), but an advance to a head h' in a subsequent 
round was responsible for h crossing over t. But in this case h must be part of the connected component of h! 
when the latter advance happens, and we can use h'\ advance to bound the congestion. 

To make this more formal, let H be the set of heads of the original forest whose advances made them cross time 
t, namely, h € H iff t\me B+1 (h) > t, t\me t (h) < t and time s+1 (parent(/i)) < t. Moreover, for h € H let C h 
denote the connected component of h in the beginning of the iteration where an advance was executed on h, that 



is, when v was set to h in Step 3 . The above argument shows that these components C^'s contain all the nodes in 
U, hence it suffices to see how they increase the congestion at time t. 

In fact, it is sufficient to focus just on the heads in H. To see this, consider h € H and notice that no node in 



U n Ch is an ancestor of another. Then [Observation 43 gives Yl u ^ur\C h P r °b(w) < prob(/i), and adding over all 



h in H gives £ uet/ prob(-u) < *E heH prob(fr). 

To conclude the proof, we upper bound the right hand side of the previous inequality. The idea now is that the play 
probabilities on the nodes in H cannot be too large since their parents have time B+1 < t (and each head has a 
large number of ancestors in [1, t] because it was considered for an advance). More formally, fix i, j and consider 
ahead h in Hr\T(i,j). From |Step 2| of the algorithm, we obtain that depth (h) > (l/2)time B+1 (/i) > t/2. Since 
time B+1 (parent(/i)) < t, it follows that for eveiy d < \t/2\, h has an ancestor u G T(i,j) with depth(u) = d 
and time B+1 (u) < t. Moreover, the definition of H implies that no head in H n T(i, j) can be an ancestor of 



another. Then again employing Observation 4.3 we obtain 



prob(/i) < prob(u) (Vd < L*/2j). 

h£HnT(i,j) u£T{i,j):depth{u)=d,t\me B+1 (u)<t 



Adding over all i,j and d < [t/2\ leads to the bound (t/2) ■ YlheH P r °b(^) < Ylu-t\me B+1 (u)<t P r °b(^)- Finally, 
using |Lemma 4?2\ we can upper bound the right hand side by t, which gives Ylu^u P r °b(«) < Yl,heH P r °b(u) < 2 
as desired. ■ 



D.3 Details of Phase III (from Section 4.2.3 ) 



Proof of Lemma |4.5| : The proof is quite straightforward. Intuitively, it is because AlgMAB (Algorithm 12) 
simply follows the probabilities according to the transition tree Tj (unless time(i, j, u) = oo in which case it 
abandons the arm). Consider an arm i such that a(i) = j, and any state u G Si. Let (v\ = Pi,V2, ■ ■ ■ , vt = u) 
denote the unique path in the transition tree for arm i from pi to u. Then, if time(i, j, u) ^ oo the probability 



that state u is played is exactly the probability of the transitions reaching u (because in |steps 8| and g, the algo- 
rithm just keeps playing the states^ and making the transitions, unless time(z, j, u) = oo). But this is precisely 
^k^iPvkyVk+t = P r °b(i,i, u)/prob(i, j, pi) (from the properties of each strategy in the convex decomposition). 



If time(z, j, u) = oo however, then the algorithm terminates the arm in |Step lQ without playing u, and so the 



probability of playing u is = prob(i, j, u) / prob(i, j, pi). This completes the proof. 

E Proofs from Section g 

E.l Layered DAGs capture all Graphs 

We first show that layered DAGs can capture all transition graphs, with a blow-up of a factor of B in the state 
space. For each arm i, for each state u in the transition graph Si, create B copies of it indexed by (v, t) for all 
1 < t < B. Then for each u and v such that p u>v > and for each 1 < t < B, place an arc (u, t) —»•(«,*+ 1). 
Finally, delete all vertices that are not reachable from the state (pi, 1) where p. t is the starting state of arm i. There 
is a clear correspondence between the transitions in Si and the ones in this layered graph: whenever state u is 
played at time t and Si transitions to state v, we have the transition from (it, t) to (v, t + 1) in the layered DAG. 
Henceforth, we shall assume that the layered graph created in this manner is the transition graph for each arm. 



7 We remark that while the plays just follow the transition probabilities, they may not be made contiguously. 
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F MABs with Budgeted Exploitation 



As we remarked before, we now explain how to generalize the argument from |Section 4 to the presence of 



"exploits". A strategy in this model needs to choose an arm in each time step and perform one of two actions: 
either it pulls the arm, which makes it transition to another state (this corresponds to playing in the previous 
model), or exploits it. If an arm is in state u and is exploited, it fetches reward r u , and cannot be pulled any more. 
As in the previous case, there is a budget B on the total number of pulls that a strategy can make and an additional 
budget of K on the total number of exploits allowed. (We remark that the same analysis handles the case when 
pulling an arm also fetches reward, but for a clearer presentation we do not consider such rewards here.) 



Our algorithm in |Section 4| can be, for the large part, directly applied in this situation as well; we now explain 
the small changes that need to be done in the various steps, beginning with the new LP relaxation. The additional 
variable in the LP, denoted by x u j (for u G <Sj, t G [B]) corresponds to the probability of exploiting state u at 
time t. 

max E«,t r « ' x u,t (LP4) 

Wu,t = Zparent(i0,t-1 ' Pparent(n),n Vt G [2, B) , U G S (F.31) 

Ef <t w u,t> > ^2(z u ,t> + x u ,t') Vt G [1, B], u G S (F.32) 
t'<t 

£„65«u,t<l Vte[l,B] (F.33) 

EueS.tetB] Vt€[l,fl] (F.34) 

w Pi ,x = l Vi€[l,n] (F.35) 

F.I Changes to the Algorithm 
Phase I: Convex Decomposition 

This is the step where most of the changes happen, to incorporate the notion of exploitation. For an arm i, its 
strategy forest xT(i, j) (the "X" to emphasize the "exploit") is an assignment of values time(i, j, u), pull (z, j, u) 
and exploit(i, j, u) to each state u G Si such that: 

(i) For u G Si and v = parent(n), it holds that time(i,j, u) > 1 + time(i, j, v), and 

(ii) For u G Si and v = parent(u) s.t t\me(i,j,u) ^ oo, then one of pull(i,j, u) or exploit(i, j, u) is equal to 
p V)U pull(z, j, v) and the other is 0; if time(z, j, u) = oo then pull(i, j, u) = exploit(i, j, u) = 0. 

For any state u, the value time(i,j, u) denotes the time at which arm i is played (i.e., pulled or exploited) at 
state u, and pull(i, j, u) (resp. exploit(i, j, u)) denotes the probability that the state u is pulled (resp. exploited). 
With the new definition, if time(i, j, u) = oo then this strategy does not play the arm at u. If state u satisfies 
exploit(i, j, u) ^ 0, then strategy xT(i, j) always exploits u upon reaching it and hence none of its descendants 
can be reached. For states u which have time(i, j, u) ^ oo and have exploit^, j, u) = 0, this strategy always pulls 
u upon reaching it. In essence, if t\me(i,j, u) ^ oo, either pull u) = pull(z, j, pi) ■ tt u , or exploit(i, j, u) = 
PuN(i,J, Pi) ■ vr u . 

Furthermore, these strategy forests are such that the following are also true. 

(!) Ej s.t time(tj,u)=t P U "(^ 3, U) = Z Ujt , 

(ii) Ej s.t time(M»=t exploit^, j, u) = x Ujt . 
For convenience, let us define prob(i, j, u) = pull(i, j, u) +exploit(z, j, u), which denotes the probability of some 
play happening at u. 



The algorithm to construct such a decomposition is very similar to the one presented in |Section D.l[ The only 



change is that in Step 7 of Algorithm D.l| , instead of looking at the first time when z Ujt > 0, we look at the first 



time when either z Ujt > or x Ujt > 0. If x u t > 0, we ignore all of u's descendants in the current forest we 
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plan to peel off. Once we have such a collection, we again appropriately select the largest e which preserves non- 
negativity of the x's and z's. Finally, we update the fractional solution to preserve feasibility The same analysis 



can be used to prove the analogous of Lemma D.l for this case, which in turn gives the desired properties for the 
strategy forests. 

Phase II: Eliminating Small Gaps 



This is identical to the Section 4.2.2 



Phase III: Scheduling the Arms 



The algorithm is also identical to that in Section 4.2.3. We sample a strategy forest xT(i,j) for each arm i and 
simply play connected components contiguously. Each time we finish playing a connected component, we play 
the next component that begins earliest in the LP. The only difference is that a play may now be either a pull 
or an exploit (which is deterministically determined once we fix a strategy forest); if this play is an exploit, the 
arm does not proceed to other states and is dropped. Again we let the algorithm run ignoring the pull and exploit 
budgets, but in the analysis we only collect reward from exploits which happen before either budget is exceeded. 

The lower bound on the expected reward collected is again very similar to the previous model; the only change 
is to the statement of |Lemma 4^ , which now becomes the following. 



Lemma F.l For arm i and strategy xT(i, j), suppose arm i samples strategy j in step 1 ofAlgMAB ( i.e., a(i) = 
j). Given that the algorithm plays the arm i in state u during this run, the probability that this play happens 
before time time(i, j, u) and the number of exploits before this play is smaller than K, is at least 11/24. 



(F.36) 



In Section 4 , we showed |Lemma 4T6| by showing that 

Pr[r u > time(i, j,u) \ £ iju ] < \ 
Additionally, suppose we can also show that 

Pr[number of exploits before u > (K - 1) | £ iju ] < ± 
Then we would have 

Pr[(number of exploits before u > {K - 1)) v (t u > t\me(i,j,u)) \ £ iju ] < 13/24, 
which would imply the Lemma. 

To show Equation F.36 we start with an analog of Lemma 4.5 for bounding arm exploitations: conditioned 
on €ij )U and cr(i') = j', the probability that arm i' is exploited at state v! before u is exploited is at most 
exploit(i', j' ,u')/prob(i' , j' , pi>). This holds even when i' = i: in this case the probability of arm i being 
exploited before reaching u is zero, since an arm is abandoned after its first exploit. Since cr(i') = j' with 
probability prob(f', j', / 0j/)/24, it follows that the probability of exploiting arm i' in state v! conditioned on 
£ij >u is at most exploit(i', j', u')/24. By linearity of expectation, the expected number of exploits before 
u conditioned on £ij tU is at mos t Yl(i'.j'.u') exp loit^', j',u') /24 = Yl u ' t x u,t/2^, which is upper bounded by 
K/24 due to LP feasibility. Then Equation F36 follows from Markov inequality. 



The rest of the argument is identical to that in Section 4 giving us the following. 



Theorem F.2 There is a randomized 0(l)-approximation algorithm for the MAB problem with an exploration 
budget of B and an exploitation budget of K. 
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